Retrieving all network requests required to load a webpage using python - python

Let say I am making a python request
url = "https://www.google.com"
r = requests.get(url)
Is there any method for getting all the network requests needed to load such a website, for example, those listed in the inspect element tool in chrome? I believe that I could achieve the same effect using Selenium, but is there any library or method that I could use to simply get all the network requests/network responses when requesting a URL.

Selenium Wire may be worth a try. I haven't been able to find much else in this space either.
https://github.com/wkeeling/selenium-wire
Selenium Wire extends Selenium's Python bindings to give you access to the underlying requests made by the browser. You author your code in the same way as you do with Selenium, but you get extra APIs for inspecting requests and responses and making changes to them on the fly.
This article describes more HTTP Request packages that may have similar capabilities or related extensions.
https://www.twilio.com/blog/5-ways-http-requests-python

Related

Requests How to use two different web sites and switching them?

Hello is there a way to use two different web site urls and switching them?
I mean i have two different websites like:
import requests
session = request.session()
firstPage = session.get("https://stackoverflow.com")
print("Hey! im in first page now!")
secondPage = session.get("https://youtube.com")
print("Hey! im in second page now!")
i know a way to do it in selenium like this: driver.switch_to.window(driver.window_handles[1])
but i want do it in "Requests" so is there a way to do it?
Selenium and Requests are two fundamentally different services. Selenium is a headless browser which fully simulates a user. Requests is a python library which simply sends HTTP requests.
Because of this, Requests is particularly good for scraping static data and data that does not involve javascript rendering (through jQuery or similar), such as RESTful APIs, which often return JSON formatted data (with no HTML styling, or page rendering at all). With Requests, after the initial HTTP request is made, the data is saved in an object, and the connection is closed.
Selenium allows you to traverse through complex, javascript-rendered menus and the like, since each page is actually built (under the hood) as if it were being displayed to a user. Selenium encapsulates everything that your browser does except displaying the HTML (including the HTTP requests that Requests is built to perform). After connecting to a page with Selenium, the connection remains open. This allows you to navigate through a complex site where you would need the full URL of the final page to use Requests.
Because of this distinction, it makes sense that Selenium would have a switch_to_window method, but Requests would not. The way your code is written, you can access the response to the HTTP get calls which you've made directly though your variables (firstPage contains the response from stackoverflow, secondPage contains the response from youtube). While using Requests, you are never "in" a page in the sense that you can be with Selenium, since it is an HTTP library and not a full headless browser.
Depending on what you're looking to scrape, it might be better to use either Requests or Selenium.

BeautifulSoup request is returning an empty list from LinkedIn.com/jobs

I'm new to BeautifulSoup and web scraping so please bare with me.
I'm using Beautiful soup to pull all job post cards from LinkedIn with the title "Security Engineer". After using inspect element on https://www.linkedin.com/jobs/search/?keywords=security%20engineer on an individual job post card, I believe to have found the correct 'li' portion for the class. The code works, but it's returning an empty list '[ ]'. I don't want to use any APIs because this is an exercise for me to learn web scraping. Thank you for your help. Here's my code so far:
from bs4 import BeautifulSoup
import requests
html_text = requests.get('https://www.linkedin.com/jobs/search/?keywords=security%20engineer').text
soup = BeautifulSoup(html_text, 'lxml')
jobs = soup.find_all('li', class_ = "jobs-search-results__list-item occludable-update p0 relative ember-view")
print(jobs)
As #baduker mentioned, using plain requests won't do all the heavy lifting that browsers do.
Whenever you open a page on your browser, the browser renders the visuals, does extra network calls, and runs javascript. The first thing it does is load the initial response, which is what you're doing with requests.get('https://www.linkedin.com/jobs/search/?keywords=security%20engineer')
The page you see on your browser is because of many, many more requests:
The reason your list is empty is because the html you get back is very minimal. You can print it out to the console and compare it to the browser's.
To make things easier, instead of using requests you can use Selenium which is essentially a library for programmatically controlling a browser. Selenium will do all those requests for you like a normal browser and let you access the page-source as you were expecting it to look.
This is a good place to start, but your scraper will be slow. There are things you can do in Selenium to speed things up, like running in headless-mode
which means don't render the page graphically, but it won't be as fast as figuring out how to do it on your own with requests.
If you want to do it using requests you're going to need to do a lot of snooping through the requests, maybe using a tool like postman, and see how to simulate the necessary steps to get the data from whatever page.
For example some websites have a handshake process when logging in.
A website I've worked on goes like this:
(Step 0 really) Setup request headers because the site doesn't seem to respond unless User-Agent header is included
Fetch initial HTML, get unique key from a hidden element in a <form>
Using this key, make a POST request to the url from that form
Get a session id key from the response
Setup a another POST request that combines username, password, and sessionid. The URL was in some javascript function, but I found it using the network inspector in the devtools
So really, I work strictly with Selenium if it's too complicated and I'm only getting the data once or not so often. I'll go through the heavy stuff if I'm building a scraper for an API that others will use frequently.
Hope any of this made sense to you. Happy scraping!

How to surf the web without cookies from code

I was try to scrape some links from the web via google search.
Let's say my query is [games site:pastebin.com].
I was trying this in both python and dart but the result i got was that i need to login in for it and i don't ant to use cookies.
So, is there any way to get the result of https://www.google.com/search?q=site%3Apastebin.com+games from code block without cookies?
The Code I Tried:
Python 3.9.5
import requests
req = requests.get("https://www.google.com/search?q=games+site%3Apastebin.com")
That fully depends on the website you are trying to access. Some pages won't allow you to use certain features from their page without cookies at all, some do. For the purpose you are trying to achieve, I'd rather recommend using a search API, which doesn't require cookies - since cookies are normally for regular users.
Google usually asfaik doesn't like it if you scrape their content using scripts.
As mentioned before, you can look for alternative search engines, which don't require cookie usage

Is there any other way to extract data from dynamic website, rather than using selenium?

I am trying to extract the data from the website https://shop.nordstrom.com/ for all the products (like shirt, t-shirt and so on). The page is dynamically loaded. I know I can use selenium with headless browser, but that is also a time consuming process and looking up on the elements, having strange ID and class names, that is also not so promising.
So I thought of looking up on the Network tool, if I can find the path to the API, from where the data is being loaded (XHR Request) . But I could not find any thing helpful. So is there a way to get the data from the website ?
If you don't want to use selenium then the alternative is to use a web parser like bs4 or use simply the request module.
You are on the right path in finding the call to the API. XHR requests can be seen under the network tab but the multitude of resources that appears makes it intricate to understand the requests being made. A simple way around this is to use the following method:
Instead of Network tab go to the console tab. There click on the settings icon, and then tick just the option Log XMLHTTPRequests.
Now refresh the page and scroll down to initiate dynamic calls. You will now be able to see the logs of all XHR in a more clear way.
For example
(index):29 Fetch finished loading: GET "**https://shop.nordstrom.com/api/recs?page_type=home&placement=HP_SALE%2CHP_TOP_RECS%2CHP_CUST_HIS%2CHP_AFF_BRAND%2CHP_FTR&channel=web&bound=24%2C24%2C24%2C24%2C6&apikey=9df15975b8cb98f775942f3b0d614157&session_id=0&shopper_id=df0fdb2bb2cf4965a344452cb42ce560&country_code=US&experiment_id=945b2363-c75d-4950-b255-194803a3ee2a&category_id=2375500&style_id=0%2C0%2C0%2C0&ts=1593768329863&url=https%3A%2F%2Fshop.nordstrom.com%2F&zip_code=null**".
Making a get request to that URL gives a bunch of Json objects. You can now use this url and others that you can derive to make the request straight to the URL.
See the answer here on how you can integrate the url with a request module to fetch data.

Different Twitter HTML structure for browsers and python web opener

I'm working on a script which downloads some data from Twitter profiles. I found out that HTML structure is different in web browser than in python "robot" because when I open the page through python urllib2 and BeautifulSoup I get different tag IDs and classes. Is there a way to get the same content as in web browser?
I need it for short urls resolving because in web browser, resolved urls are stored in link title attribute.
Most websites adapt their response according to the User-Agent header on the request. If none is set, it is obvious that this is not a browser, but some sort of script. You'll probably want to set a User-Agent header that is somewhat similar to a "real" browser.
Lots of methods to do this are described here: Changing user agent on urllib2.urlopen and here: Fetch a Wikipedia article with Python
On an unrelated note, you might want to use Requests, which is a much better API than the standard urllib2.
Don't screen scrape for twitter profile information. Use the api. Your whole program will be much more robust. It's probably against their TOS to change your user agent and mess with stuff too.

Categories

Resources