Python requests not returning response - python

1 - The target DOMAIN is https://www.dnb.com/
This website is blocking access to it from many countries around the world including mine (Algeria).
So the known solution is clear (use a proxy), which I did.
2 - Configuring the system proxy in the network configuration, and connecting to the website via (Google Chrome) works, also using Firefox with the proxy settings works fine.
3 - I came to my code to start the job
import requests
# 1. Initialize the proxy
proxy = "xxx.xxx.xxx.xxx:3128"
# 2. Setting the Headers (I cloned Firefox request headers)
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep - alive",
"Accept": "text/html, application/xhtml+xml, application/xml;q=0.9, image/webp, */*;q = 0.8",
"Upgrade - Insecure - Requests": "1",
"Host": "www.dnb.com",
"DNT": "1"
}
# 3. URL
URL = "https://www.dnb.com/business-directory/company-profiles.bicicletas_monark_s-a.7ad1f8788ea84850ceef11444c425a52.html"
# 4. Make a get request.
r = requests.get(URL, headers=headers, proxies={"https": proxy})
# Nothing in return and program keep executing (like infinite loop).
Note:
I know this keeps on waiting because the default timeout is set to None, but it is sure that the setup is working, and the requests library must return a response, using the timeout here can be to assess the reliability of the proxy as an example.
So, What the cause for this, it stuck (and I'm also), I'm getting the response and the correct HTML content with (Firefox, Chrome, Postman) with the same configuration.

I checked your code and ran it on my local machine. It seems the issue is with proxy. I added a public proxy and it is working. You can confirm it by adding a "timeout" argument to the requests.get function to some seconds. Also if the code working properly(even the response is 403) it means there is an issue with the proxy.

Related

Retrieving Powershell Proxy Config

On Powershell, I am currently performing this request, copied from the network tab on developer tools
$session = New-Object Microsoft.PowerShell.Commands.WebRequestSession
$session.UserAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36"
Invoke-WebRequest -useBasicParsing -Uri "https://......?...."
-WebSession $session
-Headers #{
"Accept"="*/*"
"Accept-Encoding"="gzip, deflate, br"
"Accept-Language"="en-US,en;q=0.9"
"Authorization"="Basic mzYw....="
"Referer"="https://......."
"sec-Fetch-Dest"="empty"
"sec-Fetch-Mode"=="cors"
"sec-Fetch-Site="same-origin"
"sec-ch-ua"="`""Chromium`";v=`"106`",`"Google Chrome`";v=`"106`",`"NotlA=Brand`";v=`"99`""
"sec-ch-ua-mobile"="?0"
"sec-ch-ua-platform"="`"Windows`""
}
-ContentType "application/x-www-form-urlencoded"
which returns me the response 200 just fine.
However, when i tried to perform the same requests, with the headers config on python requests, I am getting a SSL proxy-related error (see SSL_verification wrong version number even with certifi verify).
Is proxy automatically configured on PowerShell requests? How can I find out what proxy are my requests currently routed to? Otherwise, how can I replicate 1:1 powershell requests to python requests?
I have tried running ipconfig /all command and using the Primary Dns Suffix field as proxy arguments in requests
requests.get(url, header = headers_in_powershell, proxies = { 'http': 'the_dns_suffix', 'https': 'the_dns_suffix' }
but the requests just gets stuck (waits with no response indefinitely).
For most commands, Powershell uses the system proxy by default (or they have a -Proxy switch to tell them where it is, but some don't and have to be told to use it.
From memory, Invoke-WebRequest can be problematical as (I think) it uses the .NET web client
Try adding this to the start of the PS script:
[System.Net.WebRequest]::DefaultWebProxy = [System.Net.WebRequest]::GetSystemWebProxy()
[System.Net.WebRequest]::DefaultWebProxy.Credentials = [System.Net.CredentialCache]::DefaultNetworkCredentials

Can access website's API via browser but not python requests (cookie not working)

I'm trying to scrape data from a dynamic website using Python requests.
I've had a look through the network requests in developer tools and found the URL the website sends GET requests to in order to access the required data:
When a request is made to the website's API it returns a cookie (via the Set-Cookie header) which I believe the browser then uses in future GET requests to access the data. Here is a screenshot of the request and response headers when the page is first loaded and all previous cookies have been removed:
When I load the request URL directly in my browser it works fine (it's able to acquire a valid cookie from the website and load the data). But when I send a GET request to that same URL via the Python requests module, the cookie returned doesn't seem to be working (I'm getting a 403 - Forbidden error).
My Python code:
import requests
session = requests.Session()
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en-GB,en;q=0.9",
"Host": "www.oddschecker.com",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36 Edg/105.0.1343.33",
}
response = session.get(url, headers=headers)
# Currently returning a 403 error unless I add the cookie from my browser as a header.
I believe the cookie is the issue because when I instead take the cookie generated by the browser and use that as a header in my Python program it is then able to return the desired information (until the cookie expires.)
My goal is for the Python program to be able to acquire a working cookie from this website automatically so it can successfully send requests and gather data.

Auto generating cookies when sending post request

I am trying to send a Http Post request to a website with these headers :
headers = {
"content-type": "application/x-www-form-urlencoded; charset=UTF-8",
"cookie": "__gpi=UID=00000625243f2b12:T=1654153135:RT=1654342443:S=ALNI_MbdFxSgua2dONohDTz9bEGks8vnoQ; __gads=ID=05dae5d77dbc463f:T=1654153135:S=ALNI_MbLIzKIHhP022gtr7bRBqu9PSxNtQ; PHPSESSID=8a932c5bbe4d667513dfdc3a0051ed37",
"origin": "https://www.dcode.fr",
"pragma": "no-cache",
"referer": "https://www.dcode.fr/cipher-identifier",
"sec-fetch-site": "same-origin",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36 OPR/87.0.4390.45",
"x-requested-with": "XMLHttpRequest"
}
At first it is working perfectly.
But after some time this stop working.I think because cookies expire.
Erroneous Output :
{"captcha":"<script>$.getScript('https:\/\/www.google.com\/recaptcha\/api.js').done(function( script, textStatus ) {\n $('#captcha').addClass('g-recaptcha').attr({'data-sitekey':'6LeoCVQaAAAAALADLNorGItVJxP40YUjD1Q3S0zp','data-callback':'recaptcha_callback'});\n });\n<\/script>\n<div id='captcha'><\/div>"}
Expected output :
{"caption":"dCode's analyzer suggests to investigate:","results":{"<a href=\"\/rot-13-cipher\" target=\"_blank\">ROT-13 Cipher<\/a>":"\u25a0\u25a0","<a href=\"\/base-58-cipher\" target=\"_blank\">Base 58<\/a>":"\u25a0","<a href=\"\/playfair-cipher\" target=\"_blank\">PlayFair Cipher<\/a>":"\u25a0","<a href=\"\/base-64-encoding\" target=\"_blank\">Base64 Coding<\/a>":"\u25a0","<a href=\"\/substitution-cipher\" target=\"_blank\">Substitution Cipher<\/a>":"\u25aa","<a href=\"\/rot-cipher\" target=\"_blank\">ROT Cipher<\/a>":"\u25aa","<a href=\"\/caesar-cipher\" target=\"_blank\">Caesar Cipher<\/a>":"\u25aa","<a href=\"\/shift-cipher\" target=\"_blank\">Shift Cipher<\/a>":"\u25aa","<a href=\"\/hill-cipher\" target=\"_blank\">Hill Cipher<\/a>":"\u25aa","<a href=\"\/affine-cipher\" target=\"_blank\">Affine Cipher<\/a>":"\u25aa","<a href=\"\/keyboard-change-cipher\" target=\"_blank\">Keyboard Change Cipher<\/a>":"\u25ab","<a href=\"\/vigenere-cipher\" target=\"_blank\">Vigenere Cipher<\/a>":"\u25ab","<a href=\"\/homophonic-cipher\" target=\"_blank\">Homophonic Cipher<\/a>":"\u25ab","<a href=\"\/autoclave-cipher\" target=\"_blank\">Autoclave Cipher<\/a>":"\u25ab","<a href=\"\/beaufort-cipher\" target=\"_blank\">Beaufort Cipher<\/a>":"\u25ab","<a href=\"\/burrows-wheeler-transform\" target=\"_blank\">Burrows\u2013Wheeler Transform<\/a>":"\u25ab"}
If I copy the cookie from capturing requests using browser's developer tools and paste it in code, Then it will work again for a short amount of time.
How can I byoass this recaptcha error ?
the website or api is running some kind of js authentication to block anything that is not a browser to bypass this you have 2 options
either reverse the js and understand how the cookies are constructed and replicate them in python (this is very hard and might take weeks of reverse engineering)
or you can create a selenium instance that visits the site and waits for the cookies to be present then simply passes them to requests you will have to do this each time captcha is presented (this is the easier option but this will make your script slower)
This is not necessarily because cookies are expired, take a look at your output, it's a recaptcha. You need to solve the captcha first.
In addition to that, make sure you are changing requests' default useragent.
Consider using requests.Session if you are not using it already or alternatively selenium if possible

Python Requests 403 Enable Cookies

I am getting the following 403 Forbidden response when trying to access a site from within my Python app. I am NOT being challenged by CloudFlare with any Captcha as far as I can tell, as is the case in a lot of other people’s similar questions, it’s asking me to enable cookies. The website returns 200 OK if I try via CURL or via any browser, so it’s not IP restrictions, it’s just my Python Request it doesn’t like. I have tried various combinations of User-Agent to no avail, tried http, https and nothing at all before the target URL, and I’ve mimicked exactly what the browser Network Inspector shows in the requests header from a successful regular browser GET.
Here’s the error in the http response: (status 403)
Please enable cookies.
Error 1020
Ray ID: 69c89e49895c40d7 • 2021-10-11 14:01:04 UTC
Access denied
What happened?
This website is using a security service to protect itself from online attacks.
Cloudflare Ray ID: 69c89e49895c40d7 • Your IP: x.x.x.x • Performance & security by Cloudflare
Please enable cookies.
Here’s my Python:
'''
r = requests.get( “www.oddschecker.com”,
headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Firefox/68.0",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8",
"Accept-Language": "en-GB,en;q=0.5",
"method": "GET",
"content-type": "text/plain",
"accept-encoding": "gzip, deflate, br",
"Connection": "keep-alive",
"scheme": "https",
"Upgrade-Insecure-Requests": "1",
"Cache-Control": "max-age=0",
"Host": "www.oddschecker.com",
"TE":"Trailers"
},)
'''
Questions:
How does CloudFlare know that I need to enable cookies or that I’m not a regular browser just from my Python request? I send the request and get an immediate 403 back. The request is exactly the same as if I use a browser. It’s almost as though there’s some traffic going on that network inspector doesn’t show between my request and the 403. I used Fiddler too, and that just shows the same: GET request, immediate 403 response.
How DO I enable cookies within Python?
The Python Requests library has support for adding a cookie dictionary:
https://stackoverflow.com/a/7164897/13343799
You can see your cookies key-value pairs by (in Chrome) clicking F12 to open Developer Options -> Application tab -> Cookies -> select a cookie and see the Cookie Value below it. Key is before the =, value is after.

get auth code from URL response - Python

I'm using OneDrive Python SKD in order to handle authentication with OneDrive SDK. The authentication is done as:
import onedrivesdk
from onedrivesdk.helpers import GetAuthCodeServer
redirect_uri = "http://localhost:8080/"
client_secret = "your_app_secret"
client = onedrivesdk.get_default_client(client_id='your_client_id',
scopes=['wl.signin',
'wl.offline_access',
'onedrive.readwrite'])
auth_url = client.auth_provider.get_auth_url(redirect_uri)
#this will block until we have the code
code = GetAuthCodeServer.get_auth_code(auth_url, redirect_uri)
client.auth_provider.authenticate(code, redirect_uri, client_secret)
However since I use a EC2 instance to run this authentication, and furthermore I don't want to utilize a browser just for that, the code blocks indefinitely. Here's the get_auth_code from Microsoft:
def get_auth_code(auth_url, redirect_uri):
"""Easy way to get the auth code. Wraps up all the threading
and stuff. Does block main thread.
Args:
auth_url (str): URL of auth server
redirect_uri (str): Redirect URI, as set for the app. Should be
something like "http://localhost:8080" for this to work.
Returns:
str: A string representing the auth code, sent back by the server
"""
HOST, PORT = urlparse(redirect_uri).netloc.split(':')
PORT = int(PORT)
# Set up HTTP server and thread
code_acquired = threading.Event()
s = GetAuthCodeServer((HOST, PORT), code_acquired, GetAuthCodeRequestHandler)
th = threading.Thread(target=s.serve_forever)
th.start()
webbrowser.open(auth_url)
# At this point the browser will open and the code
# will be extracted by the server
code_acquired.wait() # First wait for the response from the auth server
code = s.auth_code
s.shutdown()
th.join()
return code
I want to return the code. Here's a sample of auth_url:
https://login.live.com/oauth20_authorize.srf?scope=wl.offline_access+onedrive.readwrite&redirect_uri=http%3A%2F%2Flocalhost%3A8080%2F&response_type=code&client_id='your_client_id'
When I enter that URL in the browser, I get the code back:
http://localhost:8080/?code=Mb0bba7d1-adbc-9c1d-f790-3709cd0b9f16
SO I want to avoid that cumbersome process to get the code back by using requests. How can I accomplish that?
I know this is an old question, but I was struggling with the same problem - I wanted to get the code using requests library. I managed to do it somehow, but I'm doubting that this isn't a very sustainable solution. Hopefully, after reading my solution you will understand better how the authentication works and you might find an improved solution.
I have a Python Flask app with mySQL database. Occasionally, I want to create a backup of the database and send the backup file to my OneDrive, plus I want to fire this process inside my Flask App.
First, I registered my App at Microsoft Application Registration Portal and added a new platform Web with redirect url http://localhost:8080/signin-microsoft. I gave the app read and write permissions and stored the Application Id (client_id) and Application Secret (client_secret).
Second, I added a new route to my Flask App. Note that my Flask App is running on localhost:8080.
#app.route("/signin-microsoft", methods=['GET'])
def get_code():
return 'Yadda'
Third, I replicated the HTTP request header created by my browser to my requests.get call. That is, I opened Chrome, pasted auth_url to address bar, hit enter, inspected the request header and copied its content to my code.
r = requests.get(auth_url,
headers = {"Host" : "login.live.com",
"Connection" : "keep-alive",
"Accept" : "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Encoding" : "gzip, deflate, br",
"Upgrade-Insecure-Requests" : "1",
"Accept-Language" : "fi-FI,fi;q=0.9,en-US;q=0.8,en;q=0.7",
"User-agent" : "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
"Cookie": (SUPER LONG WONT PASTE HERE)})
Fourth, I parsed the code from url the request was redirected.
re_url = r.url
code = re_url.split('code=')[-1]
Here is the final code:
redirect_uri = 'http://localhost:8080/signin-microsoft'
client_secret = CLIENT_SECRET
client_id = CLIENT_ID
api_base_url='https://api.onedrive.com/v1.0/'
scopes=['wl.signin', 'wl.offline_access', 'onedrive.readwrite']
http_provider = onedrivesdk.HttpProvider()
auth_provider = onedrivesdk.AuthProvider(
http_provider = http_provider, client_id=client_id, scopes=scopes)
client = onedrivesdk.OneDriveClient(api_base_url, auth_provider, http_provider)
auth_url = client.auth_provider.get_auth_url(redirect_uri)
r = requests.get(auth_url,
headers = {"Host" : "login.live.com",
"Connection" : "keep-alive",
"Accept" : "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Encoding" : "gzip, deflate, br",
"Upgrade-Insecure-Requests" : "1",
"Accept-Language" : "fi-FI,fi;q=0.9,en-US;q=0.8,en;q=0.7",
"User-agent" : "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
"Cookie": (SUPER LOOONG)})
re_url = r.url
code = re_url.split('code=')[-1]
client.auth_provider.authenticate(code, redirect_uri, client_secret)
I think that there are two main points here: You need a HTTP server that listens redirect uri (in Microsoft's example they used HTTPServer from http.server) and you need to get the headers of the request right. Without the headers, the request won't redirect correctly and you won't get the code!

Categories

Resources