I hit the wall trying to make request to https://1stkissmanga.io/ due to CloudFlare protection. I prepared header and cookie (which i read from Firefox) but still without success. What is weird, i can get this site properly with wget. This is the problem i don't understand - wget doesn't have any CloudFlare bypass mechanisms so if it works from wget then shouldn't it work also from Python requests?
Of course with wget i still need to give cookie value, otherwise wget will hit CloudFlare as well.
With wget (successful result):
wget "https://1stkissmanga.io/" -U "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:97.0) Gecko/20100101 Firefox/97.0" --header="Cookie: __cf_bm=<some long string with dots and other special characters>"
With python:
headers = {"user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:97.0) Gecko/20100101 Firefox/97.0",} cookies = {"__cf_bm": "<some long string with dots and other special characters>",}
url = "https://1stkissmanga.io/" res = requests.get(url, headers=headers, cookies=cookies)
I tried also to put cookie into header like
headers = {"user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:97.0) Gecko/20100101 Firefox/97.0", "cookie": "__cf_bm=<some long string with dots and other special characters>",}
and do res = requests.get(url, headers=headers) but the result is the same. Whatever i do, request always stop on CloudFlare protection.
Not sure what to do next, CloudFlare proxy is out of question for now.
You should use string inside "Cookie" key, not dict. It should look like this {"user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:97.0) Gecko/20100101 Firefox/97.0", "Cookie": "cf_clearance=<some hash here>; cf_chl_2=<some hash here>; cf_chl_prog=x11; XSRF-TOKEN=<some hash here>; laravel_session=<some hash here>; __cf_bm=<some hash here>;"}
The comlete code looks like this, but remember that check works for 10-15 minutes, after that you will need to take new cookie from browser.
import requests
h = {
"user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:97.0) Gecko/20100101 Firefox/97.0",
"Cookie": "cf_clearance=<some hash here>; cf_chl_2=<some hash here>; cf_chl_prog=x11; XSRF-TOKEN=<some hash here>; laravel_session=<some hash here>; __cf_bm=<some hash here>;"
}
requests.get(url, headers=h)
Related
Good morning,
Since yesterday, I'm having timeouts doing requests to ebay website. The code is simple:
import requests
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"}
htlm=requests.get("https://www.ebay.es",headers=headers).text
Tested with google and it works. This is the response I receive:
'\nGateway Timeout - In read \n\nGateway Timeout\nThe proxy server did not receive a timely response from the upstream server.\nReference #1.477f1602.1645295618.7675ccad\n\n'
What happened or changed? How could I solve it?
Removing the headers should work. Perhaps they don't like that user agent for some reason.
import requests
# headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"}
headers = {}
url = "https://www.ebay.es"
response = requests.get(url, headers=headers)
html_text = response.text
I am able to open this url via a browser and see the response in json format. However, when I use the requests module, there is no response from the method.
import requests
response = requests.get('https://api.nasdaq.com/api/calendar/earnings?date=2021-02-23')
What is wrong here?
this worked for me:
url = 'https://api.nasdaq.com/api/calendar/earnings?date=2021-02-23'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'}
response = requests.get(url, headers=headers)
Explanation
The site is blocking requests from python. Refer to explanation here
When adding the headers of the query that appear when inspecting the element in chrome, the request works well in python:
import requests
response = requests.get('https://api.nasdaq.com/api/calendar/earnings?date=2021-02-23',headers={"authority":"api.nasdaq.com","scheme":"https","path":"/api/calendar/earnings?date=2021-02-23","pragma":"no-cache","cache-control":"no-cache","accept":"application/json, text/plain, */*","user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36","origin":"https://www.nasdaq.com","sec-fetch-site":"same-site","sec-fetch-mode":"cors","sec-fetch-dest":"empty","referer":"https://www.nasdaq.com/","accept-encoding":"gzip, deflate, br","accept-language":"en-US,en;q=0.9,es;q=0.8,nl;q=0.7"})
This is the reproducible code:
import requests
url = 'http://wjw.hubei.gov.cn/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36'}
res = requests.get(url,headers=headers)
print(res)
The code print(res) gives the following output:
<Response [412]>
I can open the webpage fine on my computer with Chrome.
Is there something missing in the header? Is there a way to get around the 412 error? Thanks in advance!
That website require a valid Cookie in order to response back to you.
I've tried several ways such as calling the main website and then retrieving the Cookie under requests.Session() but the website is not allowing me to pass through.
So the only way which you can use as for now. Or to use Selenium or pass a valid Cookie to the requests
Here's how to get the Cookie and User-Agent via the browser:
Using the following Code:
import requests
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0",
"Cookie": "Hm_lvt_5544783ae3e1427d6972d9e77268f25d=1578572654; Hm_lpvt_5544783ae3e1427d6972d9e77268f25d=1578572671; dataHide2=64fa0f2a-a6aa-43b4-adf0-ce901e8d1a37; FSSBBIl1UgzbN7N80S=sXE0qXcyGkTm4uVerLqfZyUU3XFMZzkm22k.eqVABLPe0eYMo3D8uX5ZJ07.7cCr; FSSBBIl1UgzbN7N80T=4aY.P74ZFvDef6i1BgsPAGpjsGOCcIHJFaOyshl4_fJ1WvTk1nqBkdG9PsyX3VRZcIuI8zdYiRJw4rEBQfx.Mv.GS_wT6Hzgiw.AY.UMP.Mw4iCKXGDzY1UeIH2gUd15impxzBVzZpN3MnSdqD0TUqcxSq0RrvIuE8RKT5pFLAqaNnVqtbeSACx43yIYtKJ41y8Isu6a6lNOlWNeaFJ8bx22pKm3lAIO.HIDhGSZqrUP76.q3i4Iux59f7dqJPuSRF90G1LSUBE8t8HrlWzBcSwJJJARX4Ioc0iHmHvdkVoigUitTRjLUHJM4ieOV1sLBDsq"
}
r = requests.get("http://wjw.hubei.gov.cn/", headers=headers)
print(r)
Output:
<Response [200]>
Update:
import requests
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0"}
with requests.Session() as req:
r = req.get("http://www.hubei.gov.cn/")
headers['Cookie'] = r.headers.get("Set-Cookie")
for item in range(10):
new = req.get("http://wjw.hubei.gov.cn/", headers=headers)
print(new)
import requests
response=requests.get("https://precog.iiit.ac.in/")
< Response [200] >
<Response [400]>
<Response [800]>
None of the above responses
I use Python requests to get images, but in some case sit doesn't work. It seems to happen more often. An example is
http://recipes.thetasteofaussie.netdna-cdn.com/wp-content/uploads/2015/07/Leek-and-Sweet-Potato-Gratin.jpg
It loads fine in my browser, but using requests, it returns html that says "403 forbidden" and "nginx/1.7.11"
import requests
image_url = "<the_url>"
headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.76 Safari/537.36', 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Encoding':'gzip,deflate,sdch'}
r = requests.get(image_url, headers=headers)
# r.content is html '403 forbidden', not an image
I have also tried with this header, which has been necessary in some cases. Same result.
headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.76 Safari/537.36', 'Accept':'image/webp,*/*;q=0.8','Accept-Encoding':'gzip,deflate,sdch'}
(I had a similar question a few weeks ago, but this was answered by the particular image file types not being supported by PIL. This is different.)
EDIT: Based on comments:
It seems the link only works if you have already visited the original site http://aussietaste.recipes/vegetables/leek-vegetables/leek-and-sweet-potato-gratin/ with the image. I suppose the browser then uses the cached version. Any workaround?
The site is validating the Referer header. This prevents other sites from including the image in their web pages and using the image host's bandwidth. Set it to the site you mentioned in your post, and it will work.
More info:
https://en.wikipedia.org/wiki/HTTP_referer
import requests
image_url = "http://recipes.thetasteofaussie.netdna-cdn.com/wp-content/uploads/2015/07/Leek-and-Sweet-Potato-Gratin.jpg"
headers = {
'User-agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.76 Safari/537.36',
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding' : 'gzip,deflate,sdch',
'Referer' : 'http://aussietaste.recipes/vegetables/leek-vegetables/leek-and-sweet-potato-gratin/'
}
r = requests.get(image_url, headers=headers)
print r
For me, this prints
<Response [200]>
Is there a way to copy the network data from Firebug (for example POST headers) and put them into Python code so I don't need to write each header by myself?
There is an option Copy Request Headers, but it is not in the right format for Python.
So the thing I want is not to obtain this:
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:37.0) Gecko/20100101 Firefox/37.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
because I have to change the format to dictionary or something else, but this:
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; rv:37.0) Gecko/20100101 Firefox/37.0"
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
It is not necessary to get it in Python's dictionary format. The only thing I want is to automatically use this data in Python.
Post-process the headers you've copied from Firefox: split each line of the input string by : and make a dictionary, example:
In [1]: headers = """
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:37.0) Gecko/20100101 Firefox/37.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
"""
In [2]: dict(item.split(": ", 1) for item in headers.splitlines() if item)
Out[2]:
{'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; rv:37.0) Gecko/20100101 Firefox/37.0'}