I'm trying to scrape a web site. Before writing the code, I copied the request from the browser's network tool as a curl and pasted it into Insomnia (like Postman). I could resend the same request with an 200 response in Insomnia.
However, when I mimic the same request in a scrapy Request with the same body, url and header; Scrapy receives 403 responses. On the other hand, if I populate a Python code with requests library through Insomnia; the identical request works in requests library.
So how could two identical requests sent in Scrapy and Requests have different responses, that is, the one Scrapy send fails whereas the one Requests sents succeeds?
Thank you
Related
I was making a POST request using requests module in python. From the value pair of the response.json(The response from POST request), the value contains multiple lines. I can see the entire response.json while making CURL POST request. But, for requests module I cannot see the entire value which has multiple lines in it.
How could I get the same response as from the curl?
This is the response using CURL POST request
{"output":"b171_L19# show cmd status \n SSH and sudo pwdless\t:Enable\nAutostrt\t\t\t:Enabled\nCmd status\t\t\t:Running\ncmd version\t\t\t:4.1.2\n additional status\t\t:normal\n"}
This is the response using requests module in python
{"output":"b171_L19# \n"}
Hay ! I am new here so let me describe clearly my issue,Please Ignore mistakes.
I am making request on a page which literlaly works on js.
Acually its the page of paytm payemnt response through UPI.
When ever i do the requests the response is {'POLL_STATUS':"STOP_POLLING"}
But the problem is the reqest is giving this response while the browser is giving another response with loaded html.
I tried everyting like stopeed redirects and printing raw content nothing works.
I just think may be urllib post request may be work but i do not know the uses.
Can anyone please tell me how to get the exact html response as the browser gives.
Note[0]:Please dont provide answer of selenium because this issue having in the middle of my script.
Note[1]:Friendly answer appriciated.
for i in range(0,15):
resp_check_transaction=self.s.post("https://secure.website.in/theia/upi/transactionStatus?MID="+str(Merchant_ID)+"&ORDER_ID="+str(ORDER_ID),headers=check_transaction(str(ORDER_ID)),data=check_transaction_payload(Merchant_ID,ORDER_ID,TRANSID,CASHIERID))
print(resp_check_transaction.text)
resp_check_transaction=resp_check_transaction.json()
if resp_check_transaction['POLL_STATUS']=="STOP_POLLING":
print("Breaking looop")
break
time.sleep(4)
self.clear_header()
parrms={
"MID": str(Merchant_ID),
"ORDER_ID": str(ORDER_ID)
}
resp_transaction_pass=requests.post("https://secure.website.in/theia/upi/transactionStatus",headers=transaction_pass(str(ORDER_ID)),data=transaction_pass_payload(CASHIERID,UPISTATUSURL,Merchant_ID,ORDER_ID,TRANSID,TXN_AMOUNT),params=parrms,allow_redirects=True)
print("Printing response")
print(resp_transaction_pass.text)
print(resp_transaction_pass.content)
And in the web browser its showing that Status Code: 302 Moved Temporarily in the bank response of Bank response. :(
About the 302 status code
You mention that the web browser is sends a 302 status code in response to the request. In the simplest terms the 302 status code is just the web servers way of saying "Hey I know what you're looking for but it is actually located at this other URL.".
Basically all modern browsers and HTTP request libraries like Python's Requests will automatically follow a 302 redirect and act as though you send the request to the new URL instead. (Your browser's developer tools may show that a 302 redirect has happened but as far as the JavaScript is concerned it just got a normal 200 response).
If you really want to see if your Python script receives a 302 status you can do so by setting the allow_redirects option to False, but this means you will manually have to get the stuff from the new URL.
import requests
r1 = requests.get('https://httpstat.us/302', allow_redirects=False)
r2 = requests.get('https://httpstat.us/302', allow_redirects=True)
print('No redirects:', r1.status_code) # 302
print('Redirects on:', r2.status_code) # 200 (status code of page it redirects to)
Note that allow_redirects is already set to True by default, I just wanted to make the example a bit more verbose so the difference is obvious.
So why is the response content different?
So even though the browser and the Requests library are both automatically following the 302 redirect the response they get is still different, you didn't share any screenshots of the browsers requests or responses so I can only give a few educated guesses but it boils down to the fact that the request made by your Python code is somehow different from the JavaScript loaded by the web browser.
Some things to consider:
Are you sure you are using the he correct HTTP method? Is the browser also making a POST request?
If so are you sure the body of the request is the same/of the same format as the one sent by the web browser?
Perhaps the browser has a session cookie it is sending along with the request (Note this usually not explicitly said in the JS but happens automatically).
Alternatively the JS might include some API key/credentials in the HTTP auth header (this should be explicitly visible in JS).
Although unlikely it could be that whatever API you're trying to query is trying to block reverse engineering attempts by blocking the Requests library's user agent string.
Luckily all of these differences can be easily examined with some print statements and your browser's developer tools :p.
Making GET requests from urls can be done with the python requests library like this:
import requests
response = requests.get("https://someapi.io/api/data")
response.json()
I'm currently trying to make GET requests from URLs with the flask request module but i can't find any useful information.
Flask is a server, the request object you can access on flask.request is the context of the request you received on the related API request.
Please see: https://flask.palletsprojects.com/en/1.1.x/
The requests library is not related and with this one, you can do all the GET requests you want.
You can do a request in one of your API requests handled by your flask server but you would have to use directly the request module and not the request object of the flask module.
How to scrape the list of product from this page with scrapy?
I have tried the ajax request url the browser sends:
https://www.amazon.cn/gp/profile/A34PAP6LGJIN6N/more?next_batch_params%5Breview_offset%5D=10&_=1469081762384
but it returns 404.
You need to replicate the headers you see in the request.
If you inspect the response headers you can see:
from this you need to update your scrapy.Request.headers attribute. With few of these values. For the most part you can skip the Cookie since scrapy manages this one by itself and usually for ajax requests like this it's meaningless.
For this case I've manage to get a successful response by replicating only X-Requested-With header. This header is used to indicate that ajax request is happening.
You can actually test out and engineer this real time:
scrapy shell <url>
# gives you 403
request.headers.update({'X-Requested-With': 'XMLHttpRequest'})
request.headers.update({'User-Agent': <some user agent>})
fetch(request)
# now the request is redownloaded and it's 200!
I'm coding a little snippet to fetch data from a web page, and I'm currently behind a HTTP/HTTPS proxy. The requests are created like this:
headers = {'Proxy-Connection': 'Keep-Alive',
'Connection':None,
'User-Agent':'curl/1.2.3',
}
r = requests.get("https://www.google.es", headers=headers, proxies=proxyDict)
At first, neither HTTP nor HTTPS worked, and the proxy returned 403 after the request. It was also weird that I could do HTTP/HTTPS requests with curl, fetching packages with apt-get or browsing the web. Having a look at Wireshark I noticed some differences between the curl request and the Requests one. After setting User-Agent to a fake curl version, the proxy instantly lets me do HTTP requests, so I supposed the proxy filter requests by User-Agent.
So, now I know why my code fails, and I can do HTTP requests, but the code keep on failing with HTTPS. I set the headers the same way as with HTTP, but after looking at Wireshark, no headers are sent in the CONNECT message, so the proxy sees no User-Agent and returns an ACCESS DENIED response.
I think that if only I could send the headers with the CONNECT message, I could do HTTPS requests easily, but I'm breaking my head around how to tell Requests that I want to send that headers.
Ok, so I found a way after looking at http.client. It's a bit lower level than using Requests but at least it work.
def HTTPSProxyRequest(method, host, url, proxy, header=None, proxy_headers=None, port=443):
https = http.client.HTTPSConnection(proxy[0], proxy[1])
https.set_tunnel(host, port, headers=proxy_headers)
https.connect()
https.request(method, url, headers=header)
response = https.getresponse()
return response.read(), response.status
# calling the function
HTTPSProxyRequest('GET','google.com', '/index.html', ('myproxy.com',8080))