Python Web automation: HTTP Requests OR Headless browser - python

i am confused on this particular topic, i built a bot for two different websites making use of python's requests module to manually simulate the sending of HTTP PoST and GET requests.
I implemented socks proxies and also used user agents in my requests as well as referrer URL;s when neccesary (i verified actual requests sent by a browser when on these sites using burpsuite) in order to make it look genuine.
However, any accounts i run through my bots keep getting suspended. It got me wondering what i'm doing wrong, a friend suggested that maybe i should use one of these headless solutions(phantomJS) and i am leaning towards that route but i am still confused and would like to know what the difference is between using HTTP requests module and using headless browser like phantomJS.
I am not sure if there is any need to paste my source code here. Just looking for some direction on this project. thank you for taking your time to read such a long wall of text :)

Probably, you have to set cookies.
To make your requests more genuine, you should set other headers such as Host and Referer. However, the Cookies header should change every time. You can get them in this way:
from requests import Session
with Session() as session:
# Send request to get cookies.
response = session.get('your_url', headers=your_headers, proxies=proxies) # eventually add params keyword
cookies = response.cookies.get_dict()
response = session.get('your_url', headers=your_headers, cookies=cookies, proxy=proxy)
Or maybe, the site is scanning for bots in some way.
In this case, you could try to add a delay between requests with time.sleep(). You can see timings in Dev Tools on your browser. Alternatively, you could emulate all the requests you send when you connect to the site on your browser, such as ajax scripts, etc.
In my experience, using requests or using Selenium webdrivers doesn't make much difference in terms of detection, because you can't access headers and even request and response data. Also, note that Phantom Js is no longer supported. It's preferred to use headless Chrome instead.
If none of requests approach doesn't work, I suggest using Selenium-wire or Mobilenium, modified versions of Selenium, that allow accessing requests and response data.
Hope it helps.

Related

Requests How to use two different web sites and switching them?

Hello is there a way to use two different web site urls and switching them?
I mean i have two different websites like:
import requests
session = request.session()
firstPage = session.get("https://stackoverflow.com")
print("Hey! im in first page now!")
secondPage = session.get("https://youtube.com")
print("Hey! im in second page now!")
i know a way to do it in selenium like this: driver.switch_to.window(driver.window_handles[1])
but i want do it in "Requests" so is there a way to do it?
Selenium and Requests are two fundamentally different services. Selenium is a headless browser which fully simulates a user. Requests is a python library which simply sends HTTP requests.
Because of this, Requests is particularly good for scraping static data and data that does not involve javascript rendering (through jQuery or similar), such as RESTful APIs, which often return JSON formatted data (with no HTML styling, or page rendering at all). With Requests, after the initial HTTP request is made, the data is saved in an object, and the connection is closed.
Selenium allows you to traverse through complex, javascript-rendered menus and the like, since each page is actually built (under the hood) as if it were being displayed to a user. Selenium encapsulates everything that your browser does except displaying the HTML (including the HTTP requests that Requests is built to perform). After connecting to a page with Selenium, the connection remains open. This allows you to navigate through a complex site where you would need the full URL of the final page to use Requests.
Because of this distinction, it makes sense that Selenium would have a switch_to_window method, but Requests would not. The way your code is written, you can access the response to the HTTP get calls which you've made directly though your variables (firstPage contains the response from stackoverflow, secondPage contains the response from youtube). While using Requests, you are never "in" a page in the sense that you can be with Selenium, since it is an HTTP library and not a full headless browser.
Depending on what you're looking to scrape, it might be better to use either Requests or Selenium.

Retrieving all network requests required to load a webpage using python

Let say I am making a python request
url = "https://www.google.com"
r = requests.get(url)
Is there any method for getting all the network requests needed to load such a website, for example, those listed in the inspect element tool in chrome? I believe that I could achieve the same effect using Selenium, but is there any library or method that I could use to simply get all the network requests/network responses when requesting a URL.
Selenium Wire may be worth a try. I haven't been able to find much else in this space either.
https://github.com/wkeeling/selenium-wire
Selenium Wire extends Selenium's Python bindings to give you access to the underlying requests made by the browser. You author your code in the same way as you do with Selenium, but you get extra APIs for inspecting requests and responses and making changes to them on the fly.
This article describes more HTTP Request packages that may have similar capabilities or related extensions.
https://www.twilio.com/blog/5-ways-http-requests-python

How to access User-Agent in Python to get my browser details , every method i try returns Python-usrlib 2.7

i want to access browser name and version in python by sending out a request.is this the ideal method or is there any other way? because all the methods which provide user agents give PythonUserlib2.7 as user agent,i want my actual user agent.
I'll assume you're familiar with HTTP requests and their structure, if not, here's a link to the RFC documentation for HTTP/1.1 requests, at the bottom of the page, there is a list of links to header fields.
The user-agent is a field in the HTTP request header that identifies the entity that sends the request, by entity I mean the program you used to send the request, that's hosted on your machine. Things like the browser type, version and operating system, are sent in the user-agent field.
So, when you use urllib.request to send a request, urllib fills the HTTP request headers with the values you provide to it, otherwise, default values are used. That's why you get PythonUserLib2.7 as a user-agent.
If you need the user-agent of a specific browser, you need to send request using that browser, you can do that in python by using a browser automation tool, like selenium webdriver. Which you can use to launch an instance of your browser, and go to websites.
I've worked only with selenium webdriver, and it doesn't have the capability to inspect sent/received packets/requests, in other words, you can't get HTTP requests/responses directly from selenium.
As a work around, you can use selenium(or any other automation tool) to launch your browser, then go to a website that will give you your user-agent, and may even parse it.
Here's a link to selenium documentation, it explains how to get started with selenium and how to download the required packages.
If you search on google for user-agent online, Google will tell you what's your user agent.

Trying to mimic browser website requests

I'm trying to request html source from websites that check whether or not the the request was sent from a browser or not (chrome for example). Does anybody know how I can get all requests my computer is making along with the applications sending them? I'm just trying to get the truth of what my computer is sending without the possibility of anything being filtered.
Selenium webdriver is a good choice if you are sending the requests through a system which has a User Interface.
Else if you are using requests, try setting the user-agent and the required headers along with the request.

Logging into instagram using Python requests

I'm trying to log into Instagram using Python Requests. I figured it would be as simple as creating a requests.Session object and then sending a post request i.e.
session.post(login_url, data={'username':****, 'password':****})
This didn't work. I didn't know why so I tried manually entering the browsers headers (I used Chrome dev tools to see the headers of the post request) and passing them along with the request (headers={...}) even though I figured the session would deal with that. I tried sending a get request to the login URL first in order to get a cookie (and CSRF token I think) then doing the steps mentioned before. None of this worked.
I dont have much experience at all with this type of thing and I just dont understand what differentiates my post requests from google chromes (I must be doing something wrong). Thanks

Categories

Resources