I'm trying to request html source from websites that check whether or not the the request was sent from a browser or not (chrome for example). Does anybody know how I can get all requests my computer is making along with the applications sending them? I'm just trying to get the truth of what my computer is sending without the possibility of anything being filtered.
Selenium webdriver is a good choice if you are sending the requests through a system which has a User Interface.
Else if you are using requests, try setting the user-agent and the required headers along with the request.
Related
Hello is there a way to use two different web site urls and switching them?
I mean i have two different websites like:
import requests
session = request.session()
firstPage = session.get("https://stackoverflow.com")
print("Hey! im in first page now!")
secondPage = session.get("https://youtube.com")
print("Hey! im in second page now!")
i know a way to do it in selenium like this: driver.switch_to.window(driver.window_handles[1])
but i want do it in "Requests" so is there a way to do it?
Selenium and Requests are two fundamentally different services. Selenium is a headless browser which fully simulates a user. Requests is a python library which simply sends HTTP requests.
Because of this, Requests is particularly good for scraping static data and data that does not involve javascript rendering (through jQuery or similar), such as RESTful APIs, which often return JSON formatted data (with no HTML styling, or page rendering at all). With Requests, after the initial HTTP request is made, the data is saved in an object, and the connection is closed.
Selenium allows you to traverse through complex, javascript-rendered menus and the like, since each page is actually built (under the hood) as if it were being displayed to a user. Selenium encapsulates everything that your browser does except displaying the HTML (including the HTTP requests that Requests is built to perform). After connecting to a page with Selenium, the connection remains open. This allows you to navigate through a complex site where you would need the full URL of the final page to use Requests.
Because of this distinction, it makes sense that Selenium would have a switch_to_window method, but Requests would not. The way your code is written, you can access the response to the HTTP get calls which you've made directly though your variables (firstPage contains the response from stackoverflow, secondPage contains the response from youtube). While using Requests, you are never "in" a page in the sense that you can be with Selenium, since it is an HTTP library and not a full headless browser.
Depending on what you're looking to scrape, it might be better to use either Requests or Selenium.
i am confused on this particular topic, i built a bot for two different websites making use of python's requests module to manually simulate the sending of HTTP PoST and GET requests.
I implemented socks proxies and also used user agents in my requests as well as referrer URL;s when neccesary (i verified actual requests sent by a browser when on these sites using burpsuite) in order to make it look genuine.
However, any accounts i run through my bots keep getting suspended. It got me wondering what i'm doing wrong, a friend suggested that maybe i should use one of these headless solutions(phantomJS) and i am leaning towards that route but i am still confused and would like to know what the difference is between using HTTP requests module and using headless browser like phantomJS.
I am not sure if there is any need to paste my source code here. Just looking for some direction on this project. thank you for taking your time to read such a long wall of text :)
Probably, you have to set cookies.
To make your requests more genuine, you should set other headers such as Host and Referer. However, the Cookies header should change every time. You can get them in this way:
from requests import Session
with Session() as session:
# Send request to get cookies.
response = session.get('your_url', headers=your_headers, proxies=proxies) # eventually add params keyword
cookies = response.cookies.get_dict()
response = session.get('your_url', headers=your_headers, cookies=cookies, proxy=proxy)
Or maybe, the site is scanning for bots in some way.
In this case, you could try to add a delay between requests with time.sleep(). You can see timings in Dev Tools on your browser. Alternatively, you could emulate all the requests you send when you connect to the site on your browser, such as ajax scripts, etc.
In my experience, using requests or using Selenium webdrivers doesn't make much difference in terms of detection, because you can't access headers and even request and response data. Also, note that Phantom Js is no longer supported. It's preferred to use headless Chrome instead.
If none of requests approach doesn't work, I suggest using Selenium-wire or Mobilenium, modified versions of Selenium, that allow accessing requests and response data.
Hope it helps.
i want to access browser name and version in python by sending out a request.is this the ideal method or is there any other way? because all the methods which provide user agents give PythonUserlib2.7 as user agent,i want my actual user agent.
I'll assume you're familiar with HTTP requests and their structure, if not, here's a link to the RFC documentation for HTTP/1.1 requests, at the bottom of the page, there is a list of links to header fields.
The user-agent is a field in the HTTP request header that identifies the entity that sends the request, by entity I mean the program you used to send the request, that's hosted on your machine. Things like the browser type, version and operating system, are sent in the user-agent field.
So, when you use urllib.request to send a request, urllib fills the HTTP request headers with the values you provide to it, otherwise, default values are used. That's why you get PythonUserLib2.7 as a user-agent.
If you need the user-agent of a specific browser, you need to send request using that browser, you can do that in python by using a browser automation tool, like selenium webdriver. Which you can use to launch an instance of your browser, and go to websites.
I've worked only with selenium webdriver, and it doesn't have the capability to inspect sent/received packets/requests, in other words, you can't get HTTP requests/responses directly from selenium.
As a work around, you can use selenium(or any other automation tool) to launch your browser, then go to a website that will give you your user-agent, and may even parse it.
Here's a link to selenium documentation, it explains how to get started with selenium and how to download the required packages.
If you search on google for user-agent online, Google will tell you what's your user agent.
I want to write a python script for a website that requires a login to enable some features and I want to find out, what I need to put in the header of my script requests (e.g. authentication token and other parameters), so they are executed the same way as requests over the browser.
Does wireshark help with this if the website uses HTTPS?
Or is my only option executing a browser script with Selenium after a manual login?
For anyone else with the same issue: you don't need to traffic your traffic from outside the browser. Just...
use Google Chrome
open developer tools
click on the Network Tab
clear the data
and do a request in the tab where the dev-tools are
open
You should see the initial request at the top followed by
subsequent ones (advertising, external image-server etc).
You can
rightclick the initial request, save it as a .har-file and use
something like https://toolbox.googleapps.com/apps/har_analyzer/ to
extract the headers of both or the request and the response.
Now you know what parameters (key and value) you need in your header and can even use submitted values like tokens and cookies in your python script
I am trying to log into a website by requests module. The blank to be filled as username has no "name" attribute, nor does the password. However, that is what request need to konw. So what should I do to log in?
The website is:
https://passport.lagou.com/login/login.html
Thank you!
Open your favorite browser's Developer Tools (F12 on Chrome, Ctrl+Shift+I on Firefox, etc.), and reproduce the HTTP request displayed in the Network tab when trying to login.
In your case, some parameters like username and password are being sent as Form Data in a POST request to https://passport.lagou.com/login/login.json
Depending on the web application's implementation, you might also need to send some request headers, and it could also simply not work at all for various reasons.