Context
I am currently attempting to build a small-scale bot using Selenium and Requests module in Python.
However, the webpage I want to interact with is running behind Cloudflare.
My python script is running over Tor using stem module.
My traffic analysis is based on Firefox's "Developer options->Network" using Persist Logs.
My findings so far:
Selenium's Firefox webdriver can often access the webpage without going through "checking browser page" (return code 503) and "captcha page" (return code 403).
Requests session object with the same user agent always results in "captcha page" (return code 403).
If Cloudflare was checking my Javascript functionality, shouldn't my requests module return 503 ?
Code Example
driver = webdriver.Firefox(firefox_profile=fp, options=fOptions)
driver.get("https://www.cloudflare.com") # usually returns code 200 without verifying the browser
session = requests.Session()
# ... applied socks5 proxy for both http and https ... #
session.headers.update({"user-agent": driver.execute_script("return navigator.userAgent;")})
page = session.get("https://www.cloudflare.com")
print(page.status_code) # return code 403
print(page.text) # returns "captcha page"
Both Selenium and Requests modules are using the same user agent and ip.
Both are using GET without any parameters.
How does Cloudflare distinguish these traffic?
Am I missing something?
I tried to transfer cookies from the webdriver to the requests session to see if a bypass is possible but had no luck.
Here is the used code:
for c in driver.get_cookies():
session.cookies.set(c['name'], c['value'], domain=c['domain'])
There are additional JavaScript APIs exposed to the webpage when using Selenium. If you can disable them, you may be able to fix the problem.
Cloudflare doesn't only check HTTP headers or javascript — it also analyses the TLS header. I'm not sure exactly how it does it, but I've found that it can be circumvented by using NSS instead of OpenSSL (though it's not well integrated into Requests).
The captcha response depends on the browser fingerprint. It's not about just sending Cookies and User-agent.
Copy all the headers from Network Tab in Developers console, and send all the key value pairs as headers in request library.
This method should work logically.
Related
I'm not sure how else to describe this. I'm trying to log into a website using the requests library with Python but it doesn't seem to be capturing all cookies from when I login and subsequent requests to the site go back to the login page.
The code I'm using is as follows: (with redactions)
with requests.Session() as s:
r = s.post('https://www.website.co.uk/login', data={
'amember_login': 'username',
'amember_password': 'password'
})
Looking at the developer tools in Chrome. I see the following:
After checking r.cookies it seems only that PHPSESSID was captured there's no sign of the amember_nr cookie.
The value in PyCharm only shows:
{RequestsCookieJar: 1}<RequestsCookieJar[<Cookie PHPSESSID=kjlb0a33jm65o1sjh25ahb23j4 for .website.co.uk/>]>
Why does this code fail to save 'amember_nr' and is there any way to retrieve it?
SOLUTION:
It appears the only way I can get this code to work properly is using Selenium, selecting the elements on the page and automating the typing/clicking. The following code produces the desired result.
from seleniumrequests import Chrome
driver = Chrome()
driver.get('http://www.website.co.uk')
username = driver.find_element_by_xpath("//input[#name='amember_login']")
password = driver.find_element_by_xpath("//input[#name='amember_pass']")
username.send_keys("username")
password.send_keys("password")
driver.find_element_by_xpath("//input[#type='submit']").click() # Page is logged in and all relevant cookies saved.
You can try this:
with requests.Session() as s:
s.get('https://www.website.co.uk/login')
r = s.post('https://www.website.co.uk/login', data={
'amember_login': 'username',
'amember_password': 'password'
})
The get request will set the required cookies.
FYI I would use something like BurpSuite to capture ALL the data being sent to the server and sort out what headers etc are required ... sometimes people/servers to referrer checking, set cookies via JAVA or wonky scripting, even seen java obfuscation and blocking of agent tags not in whitelist etc... it's likely something the headers that the server is missing to give you the cookie.
Also you can have Python use burp as a proxy so you can see exactly what gets sent to the server and the response.
https://github.com/freeload101/Python/blob/master/CS_HIDE/CS_HIDE.py (proxy support )
How to get Authorization token from a webpage using python requests, i have used requests basicAuth to login, it was worked, but subsequent pages are not accpting te basicAuth, it returns "Authuser is not validated"
There is a login url where i have successfully logged in using python requests's basicAuth. then succeeding pages didn't accept basicAuth credential but it needed authorization header. after looking into browser inspect tool, found out that, this authorization header's value is generated as a part of session local storage. is there any way to get this session value without using webdriver API?
Sounds like what you need is a requests persistent session
import requests
s=requests.Session()
#then simply make the request like you already are
r=s.get(r'https://stackoverflow.com/')
#the cookies are persisted
s.cookies.get_dict()
>{'prov':......}
i can't really get more specific without more info about the site you're using.
I am looking to open a connection with python to http://www.horseandcountry.tv which takes my login parameters via the POST method. I would like to open a connection to this website in order to scrape the site for all video links (this, I also don't know how to do yet but am using the project to learn).
My question is how do I pass my credentials to the individual pages of the website? For example if all I wanted to do was use python code to open a browser window pointing to http://play.horseandcountry.tv/live/ and have it open with me already logged in, how do I go about this?
As far as I know you have two options depending how you want to crawl and what you need to crawl:
1) Use urllib. You can do your POST request with the necessary login credentials. This is the low level solution, which means that this is fast, but doesn't handle high level stuff like javascript codes.
2) Use selenium. Whith that you can simulate a browser (Chrome, Firefox, other..), and run actions via your python code. Then it is much slower but works well with too "sophisticated" websites.
What I usually do: I try the first option and if a encounter a problem like a javascript security layer on the website, then go for option 2. Moreover, selenium can open a real web browser from your desktop and give you a visual of your scrapping.
In any case, just goolge "urllib/selenium login to website" and you'll find what you need.
If you want to avoid using Selenium (opening web browsers), you can go for requests, it can login the website and grab anything you need in the background.
Here is how you can login to that website with requests.
import requests
from bs4 import BeautifulSoup
#Login Form Data
payload = {
'account_email': 'your_email',
'account_password': 'your_passowrd',
'submit': 'Sign In'
}
with requests.Session() as s:
#Login to the website.
response = s.post('https://play.horseandcountry.tv/login/', data=payload)
#Check if logged in successfully
soup = BeautifulSoup(response.text, 'lxml')
logged_in = soup.find('p', attrs={'class': 'navbar-text pull-right'})
print s.cookies
print response.status_code
if logged_in.text.startswith('Logged in as'):
print 'Logged In Successfully!'
If you need explanations for this, you can check this answer, or requests documentation
You could also use the requests module. It is one of the most popular. Here are some questions that relate to what you would like to do.
Log in to website using Python Requests module
logging in to website using requests
I'm having a problem here with authorization on a site. I'm using a headless browser PhantomJS on top of Selenium.
What I do:
driver.get('https://foo.com')
# some instructions to log in
# foo.com makes some requests to https://bar.foo.com and sets cookies for https://bar.foo.com domain
driver.get_cookies()
# no cookies for https://bar.foo.com
# any requests to https://foo.com are not autorized because the cookies from https://bar.foo.com are the ones that are needed, but not provided
Is there any workaround for it? Been searching to fix the problem for a long time and no success.
One more note: for example, chromedriver collects all of the cookies received from AJAX requests to subdomains, while phantomjs doesn't do that.
At work, I sit behind a proxy. When I connect to the company WiFi and open up a browser, a pop-up box usually appears asking for my company credentials before it will let me navigate to any internal/external site.
I am using the Python Requests package to automate pulling data from an external site but am encountering a 401 error that is related to not having authenticated first. This happens when I don't authenticate first using the browser. If I authenticate first with the browser and then use Python requests then everything is fine and I'm able to navigate to any site.
My questions is how do I perform the work authentication part using Python? I want to be able to automate this process so that I can set a cron job that grabs data from an external source every night.
I've tried providing an blank URL:
import requests
response = requests.get('')
But requests.get() requires a properly structure URL. I want to be able to emulate as if I've opened up a browser and capturing the pop-up that asks for authentication. This does not rely on any URL being used.
From the requests documentation
import requests
proxies = {
"http": "http://10.10.1.10:3128",
"https": "http://10.10.1.10:1080",
}
requests.get("http://example.org", proxies=proxies)