I am getting the following 403 Forbidden response when trying to access a site from within my Python app. I am NOT being challenged by CloudFlare with any Captcha as far as I can tell, as is the case in a lot of other people’s similar questions, it’s asking me to enable cookies. The website returns 200 OK if I try via CURL or via any browser, so it’s not IP restrictions, it’s just my Python Request it doesn’t like. I have tried various combinations of User-Agent to no avail, tried http, https and nothing at all before the target URL, and I’ve mimicked exactly what the browser Network Inspector shows in the requests header from a successful regular browser GET.
Here’s the error in the http response: (status 403)
Please enable cookies.
Error 1020
Ray ID: 69c89e49895c40d7 • 2021-10-11 14:01:04 UTC
Access denied
What happened?
This website is using a security service to protect itself from online attacks.
Cloudflare Ray ID: 69c89e49895c40d7 • Your IP: x.x.x.x • Performance & security by Cloudflare
Please enable cookies.
Here’s my Python:
'''
r = requests.get( “www.oddschecker.com”,
headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Firefox/68.0",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8",
"Accept-Language": "en-GB,en;q=0.5",
"method": "GET",
"content-type": "text/plain",
"accept-encoding": "gzip, deflate, br",
"Connection": "keep-alive",
"scheme": "https",
"Upgrade-Insecure-Requests": "1",
"Cache-Control": "max-age=0",
"Host": "www.oddschecker.com",
"TE":"Trailers"
},)
'''
Questions:
How does CloudFlare know that I need to enable cookies or that I’m not a regular browser just from my Python request? I send the request and get an immediate 403 back. The request is exactly the same as if I use a browser. It’s almost as though there’s some traffic going on that network inspector doesn’t show between my request and the 403. I used Fiddler too, and that just shows the same: GET request, immediate 403 response.
How DO I enable cookies within Python?
The Python Requests library has support for adding a cookie dictionary:
https://stackoverflow.com/a/7164897/13343799
You can see your cookies key-value pairs by (in Chrome) clicking F12 to open Developer Options -> Application tab -> Cookies -> select a cookie and see the Cookie Value below it. Key is before the =, value is after.
Related
I'm trying to scrape data from a dynamic website using Python requests.
I've had a look through the network requests in developer tools and found the URL the website sends GET requests to in order to access the required data:
When a request is made to the website's API it returns a cookie (via the Set-Cookie header) which I believe the browser then uses in future GET requests to access the data. Here is a screenshot of the request and response headers when the page is first loaded and all previous cookies have been removed:
When I load the request URL directly in my browser it works fine (it's able to acquire a valid cookie from the website and load the data). But when I send a GET request to that same URL via the Python requests module, the cookie returned doesn't seem to be working (I'm getting a 403 - Forbidden error).
My Python code:
import requests
session = requests.Session()
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en-GB,en;q=0.9",
"Host": "www.oddschecker.com",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36 Edg/105.0.1343.33",
}
response = session.get(url, headers=headers)
# Currently returning a 403 error unless I add the cookie from my browser as a header.
I believe the cookie is the issue because when I instead take the cookie generated by the browser and use that as a header in my Python program it is then able to return the desired information (until the cookie expires.)
My goal is for the Python program to be able to acquire a working cookie from this website automatically so it can successfully send requests and gather data.
I am trying to send a Http Post request to a website with these headers :
headers = {
"content-type": "application/x-www-form-urlencoded; charset=UTF-8",
"cookie": "__gpi=UID=00000625243f2b12:T=1654153135:RT=1654342443:S=ALNI_MbdFxSgua2dONohDTz9bEGks8vnoQ; __gads=ID=05dae5d77dbc463f:T=1654153135:S=ALNI_MbLIzKIHhP022gtr7bRBqu9PSxNtQ; PHPSESSID=8a932c5bbe4d667513dfdc3a0051ed37",
"origin": "https://www.dcode.fr",
"pragma": "no-cache",
"referer": "https://www.dcode.fr/cipher-identifier",
"sec-fetch-site": "same-origin",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36 OPR/87.0.4390.45",
"x-requested-with": "XMLHttpRequest"
}
At first it is working perfectly.
But after some time this stop working.I think because cookies expire.
Erroneous Output :
{"captcha":"<script>$.getScript('https:\/\/www.google.com\/recaptcha\/api.js').done(function( script, textStatus ) {\n $('#captcha').addClass('g-recaptcha').attr({'data-sitekey':'6LeoCVQaAAAAALADLNorGItVJxP40YUjD1Q3S0zp','data-callback':'recaptcha_callback'});\n });\n<\/script>\n<div id='captcha'><\/div>"}
Expected output :
{"caption":"dCode's analyzer suggests to investigate:","results":{"<a href=\"\/rot-13-cipher\" target=\"_blank\">ROT-13 Cipher<\/a>":"\u25a0\u25a0","<a href=\"\/base-58-cipher\" target=\"_blank\">Base 58<\/a>":"\u25a0","<a href=\"\/playfair-cipher\" target=\"_blank\">PlayFair Cipher<\/a>":"\u25a0","<a href=\"\/base-64-encoding\" target=\"_blank\">Base64 Coding<\/a>":"\u25a0","<a href=\"\/substitution-cipher\" target=\"_blank\">Substitution Cipher<\/a>":"\u25aa","<a href=\"\/rot-cipher\" target=\"_blank\">ROT Cipher<\/a>":"\u25aa","<a href=\"\/caesar-cipher\" target=\"_blank\">Caesar Cipher<\/a>":"\u25aa","<a href=\"\/shift-cipher\" target=\"_blank\">Shift Cipher<\/a>":"\u25aa","<a href=\"\/hill-cipher\" target=\"_blank\">Hill Cipher<\/a>":"\u25aa","<a href=\"\/affine-cipher\" target=\"_blank\">Affine Cipher<\/a>":"\u25aa","<a href=\"\/keyboard-change-cipher\" target=\"_blank\">Keyboard Change Cipher<\/a>":"\u25ab","<a href=\"\/vigenere-cipher\" target=\"_blank\">Vigenere Cipher<\/a>":"\u25ab","<a href=\"\/homophonic-cipher\" target=\"_blank\">Homophonic Cipher<\/a>":"\u25ab","<a href=\"\/autoclave-cipher\" target=\"_blank\">Autoclave Cipher<\/a>":"\u25ab","<a href=\"\/beaufort-cipher\" target=\"_blank\">Beaufort Cipher<\/a>":"\u25ab","<a href=\"\/burrows-wheeler-transform\" target=\"_blank\">Burrows\u2013Wheeler Transform<\/a>":"\u25ab"}
If I copy the cookie from capturing requests using browser's developer tools and paste it in code, Then it will work again for a short amount of time.
How can I byoass this recaptcha error ?
the website or api is running some kind of js authentication to block anything that is not a browser to bypass this you have 2 options
either reverse the js and understand how the cookies are constructed and replicate them in python (this is very hard and might take weeks of reverse engineering)
or you can create a selenium instance that visits the site and waits for the cookies to be present then simply passes them to requests you will have to do this each time captcha is presented (this is the easier option but this will make your script slower)
This is not necessarily because cookies are expired, take a look at your output, it's a recaptcha. You need to solve the captcha first.
In addition to that, make sure you are changing requests' default useragent.
Consider using requests.Session if you are not using it already or alternatively selenium if possible
I'm writing some tests with Selenium and noticed, that Referer is missing from the headers. I wrote the following minimal example to test this with https://httpbin.org/headers:
import selenium.webdriver
options = selenium.webdriver.FirefoxOptions()
options.add_argument('--headless')
profile = selenium.webdriver.FirefoxProfile()
profile.set_preference('devtools.jsonview.enabled', False)
driver = selenium.webdriver.Firefox(firefox_options=options, firefox_profile=profile)
wait = selenium.webdriver.support.ui.WebDriverWait(driver, 10)
driver.get('http://www.python.org')
assert 'Python' in driver.title
url = 'https://httpbin.org/headers'
driver.execute_script('window.location.href = "{}";'.format(url))
wait.until(lambda driver: driver.current_url == url)
print(driver.page_source)
driver.close()
Which prints:
<html><head><link rel="alternate stylesheet" type="text/css" href="resource://content-accessible/plaintext.css" title="Wrap Long Lines"></head><body><pre>{
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.5",
"Connection": "close",
"Host": "httpbin.org",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:64.0) Gecko/20100101 Firefox/64.0"
}
}
</pre></body></html>
So there is no Referer. However, if I browse to any page and manually execute
window.location.href = "https://httpbin.org/headers"
in the Firefox console, Referer does appear as expected.
As pointed out in the comments below, when using
driver.get("javascript: window.location.href = '{}'".format(url))
instead of
driver.execute_script("window.location.href = '{}';".format(url))
the request does include Referer. Also, when using Chrome instead of Firefox, both methods include Referer.
So the main question still stands: Why is Referer missing in the request when sent with Firefox as described above?
Referer as per the MDN documentation
The Referer request header contains the address of the previous web page from which a link to the currently requested page was followed. The Referer header allows servers to identify where people are visiting them from and may use that data for analytics, logging, or optimized caching, for example.
Important: Although this header has many innocent uses it can have undesirable consequences for user security and privacy.
Source: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Referer
However:
A Referer header is not sent by browsers if:
The referring resource is a local "file" or "data" URI.
An unsecured HTTP request is used and the referring page was received with a secure protocol (HTTPS).
Source: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Referer
Privacy and security concerns
There are some privacy and security risks associated with the Referer HTTP header:
The Referer header contains the address of the previous web page from which a link to the currently requested page was followed, which can be further used for analytics, logging, or optimized caching.
Source: https://developer.mozilla.org/en-US/docs/Web/Security/Referer_header:_privacy_and_security_concerns#The_referrer_problem
Addressing the security concerns
From the Referer header perspective majority of security risks can be mitigated following the steps:
Referrer-Policy: Using the Referrer-Policy header on your server to control what information is sent through the Referer header. Again, a directive of no-referrer would omit the Referer header entirely.
The referrerpolicy attribute on HTML elements that are in danger of leaking such information (such as <img> and <a>). This can for example be set to no-referrer to stop the Referer header being sent altogether.
The rel attribute set to noreferrer on HTML elements that are in danger of leaking such information (such as <img> and <a>).
The Exit Page Redirect technique: This is the only method that should work at the moment without flaw is to have an exit page that you don’t mind having inside of the referer header. Many websites implement this method, including Google and Facebook. Instead of having the referrer data show private information, it only shows the website that the user came from, if implemented correctly. Instead of the referrer data appearing as http://example.com/user/foobar the new referrer data will appear as http://example.com/exit?url=http%3A%2F%2Fexample.com. The way the method works is by having all external links on your website go to a intermediary page that then redirects to the final page. Below we have a link to the website example.com and we URL encode the full URL and add it to the url parameter of our exit page.
Sources:
https://developer.mozilla.org/en-US/docs/Web/Security/Referer_header:_privacy_and_security_concerns#How_can_we_fix_this
https://geekthis.net/post/hide-http-referer-headers/#exit-page-redirect
This usecase
I have executed your code through both through GeckoDriver/Firefox and ChromeDriver/Chrome combination:
Code Block:
driver.get('http://www.python.org')
assert 'Python' in driver.title
url = 'https://httpbin.org/headers'
driver.execute_script('window.location.href = "{}";'.format(url))
WebDriverWait(driver, 10).until(lambda driver: driver.current_url == url)
print(driver.page_source)
Observation:
Using GeckoDriver/Firefox Referer: "https://www.python.org/" header was missing as follows:
{
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.5",
"Host": "httpbin.org",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0"
}
}
Using ChromeDriver/Chrome Referer: "https://www.python.org/" header was present as follows:
{
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9",
"Host": "httpbin.org",
"Referer": "https://www.python.org/",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36"
}
}
Conclusion:
It seems to be an issue with GeckoDriver/Firefox in handling the Referer header.
Outro
Referrer Policy
I was trying to send http/https requests via proxy (socks5), but I can't understand if the problem is in my code or in the proxy.
I tried using this code and it gives me an error:
requests.exceptions.ConnectionError: SOCKSHTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.contrib.socks.SOCKSHTTPSConnection object at 0x000001B656AC9608>: Failed to establish a new connection: Connection closed unexpectedly'))
This is my code:
import requests
url = "https://www.google.com"
proxies = {
"http":"socks5://fsagsa:sacesf241_country-darwedafs_session-421dsafsa#x.xxx.xxx.xx:31112",
"https":"socks5://fsagsa:sacesf241_country-darwedafs_session-421dsafsa#x.xxx.xxx.xx:31112",
}
headers = {
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Sec-Gpc": "1",
"Sec-Fetch-Site": "same-origin",
"Sec-Fetch-User": "?1",
"Accept-Encoding": "gzip, deflate, br",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Dest": "document",
"Accept-Language": "en-GB,en;q=0.9"
}
r = requests.get(url, headers = headers, proxies = proxies)
print(r)
Then, I checked the proxy with an online tool
The tool manages to send requests through the proxy. .
So the problem is in this code? I can't figure out what's wrong.
Edit (15/09/2021)
I added headers but the problem is still there.
Create a local server/mock to handle the request using pytest or some other testing framework with responses library to eliminate variables external to your application/script. I’m quite sure Google will reject requests with empty headers. Also, ensure you installed the correct dependencies to enable SOCKS proxy support in requests (python -m pip install requests[socks]). Furthermore, if you are making a remote request to connect to your proxy you must change socks5 to socks5h in your proxies dictionary.
References
pytest: https://docs.pytest.org/en/6.2.x/
responses: https://github.com/getsentry/responses
requests[socks]: https://docs.python-requests.org/en/master/user/advanced/#socks
In addition to basic HTTP proxies, Requests also supports proxies using the SOCKS protocol. This is an optional feature that requires that additional third-party libraries be installed before use.
You can get the dependencies for this feature from pip:
$ python -m pip install requests[socks]
Once you’ve installed those dependencies, using a SOCKS proxy is just as easy as using a HTTP one:
proxies = {
'http': 'socks5://user:pass#host:port',
'https': 'socks5://user:pass#host:port'
}
Using the scheme socks5 causes the DNS resolution to happen on the client, rather than on the proxy server. This is in line with curl, which uses the scheme to decide whether to do the DNS resolution on the client or proxy. If you want to resolve the domains on the proxy server, use socks5h as the scheme.
1 - The target DOMAIN is https://www.dnb.com/
This website is blocking access to it from many countries around the world including mine (Algeria).
So the known solution is clear (use a proxy), which I did.
2 - Configuring the system proxy in the network configuration, and connecting to the website via (Google Chrome) works, also using Firefox with the proxy settings works fine.
3 - I came to my code to start the job
import requests
# 1. Initialize the proxy
proxy = "xxx.xxx.xxx.xxx:3128"
# 2. Setting the Headers (I cloned Firefox request headers)
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep - alive",
"Accept": "text/html, application/xhtml+xml, application/xml;q=0.9, image/webp, */*;q = 0.8",
"Upgrade - Insecure - Requests": "1",
"Host": "www.dnb.com",
"DNT": "1"
}
# 3. URL
URL = "https://www.dnb.com/business-directory/company-profiles.bicicletas_monark_s-a.7ad1f8788ea84850ceef11444c425a52.html"
# 4. Make a get request.
r = requests.get(URL, headers=headers, proxies={"https": proxy})
# Nothing in return and program keep executing (like infinite loop).
Note:
I know this keeps on waiting because the default timeout is set to None, but it is sure that the setup is working, and the requests library must return a response, using the timeout here can be to assess the reliability of the proxy as an example.
So, What the cause for this, it stuck (and I'm also), I'm getting the response and the correct HTML content with (Firefox, Chrome, Postman) with the same configuration.
I checked your code and ran it on my local machine. It seems the issue is with proxy. I added a public proxy and it is working. You can confirm it by adding a "timeout" argument to the requests.get function to some seconds. Also if the code working properly(even the response is 403) it means there is an issue with the proxy.