How to scrape data generated with infinite scroll? - python

How to scrape the list of product from this page with scrapy?
I have tried the ajax request url the browser sends:
https://www.amazon.cn/gp/profile/A34PAP6LGJIN6N/more?next_batch_params%5Breview_offset%5D=10&_=1469081762384
but it returns 404.

You need to replicate the headers you see in the request.
If you inspect the response headers you can see:
from this you need to update your scrapy.Request.headers attribute. With few of these values. For the most part you can skip the Cookie since scrapy manages this one by itself and usually for ajax requests like this it's meaningless.
For this case I've manage to get a successful response by replicating only X-Requested-With header. This header is used to indicate that ajax request is happening.
You can actually test out and engineer this real time:
scrapy shell <url>
# gives you 403
request.headers.update({'X-Requested-With': 'XMLHttpRequest'})
request.headers.update({'User-Agent': <some user agent>})
fetch(request)
# now the request is redownloaded and it's 200!

Related

Same identical response, rejected in Scrapy, accepted in Requests

I'm trying to scrape a web site. Before writing the code, I copied the request from the browser's network tool as a curl and pasted it into Insomnia (like Postman). I could resend the same request with an 200 response in Insomnia.
However, when I mimic the same request in a scrapy Request with the same body, url and header; Scrapy receives 403 responses. On the other hand, if I populate a Python code with requests library through Insomnia; the identical request works in requests library.
So how could two identical requests sent in Scrapy and Requests have different responses, that is, the one Scrapy send fails whereas the one Requests sents succeeds?
Thank you

Post request to Amazon fails because of supposedly "invalid email"

I am trying to use Python requests to log into amazon.se. To do so, I first make a GET request to one of the pages, get redirected to the sign-in page, and make a POST request using the data from the login form + my credentials.
The problem is that in response I get the sign-in page with the following error:
I am of course sure that both the email and password are valid, but the login process still fails in both Python and Postman. I tried to compare the browser requests to my manufactured ones, and they seem almost identical except for me missing a couple of what I believe are non-essential headers. Nevertheless, there must be something going on behind the scenes that I am missing.
Postman POST request
Headers:
Body (form data from previous GET request):
Browser POST request
Headers:
Body:

How should i get another redirected page url in python?

Like we open a URL to a normal browser so it will redirect to another website url. Example a shorted link. After you open this it will redirect you to the main url.
So how to do this in python I mean I need to open a URL on python and this will redirect to other website page then I will copy the other website page link.
That's all I want to know thank you.
I tried it with python requests and urllib module.
Like this
import requests
a = requests.get("url", allow_redirects = True)
And
import urllib.request
a = urllib.request.urlopen("url")
But it's not working at all. I mean didn't get the redirected page.
I know 4 types of redirections.
server sends response with status 3xx and new address
HTTP/1.1 302 Found
Location: https://new_domain.com/some/folder
Wikipedia: HTTP 301, HTTP 302, HTTP 303
server sends header Refresh with time in seconds and new address
Refresh: 0; url=https://new_domain.com/some/folder
server sends HTML with meta tag which emulates header Refresh
<meta http-equiv="refresh" content="0; url=https://new_domain.com/some/folder">
Wikipedia: meta refresh
JavaScript sets new location
location = url
location.href = url
location.replace(url)
location.assing(url)
The same for document.location, window.location
There should be also combination with open(),document.open(), window.open()
requests automatically redirects for first and (probably) second type. With urllib probably you would have to check status, get url, and run next request - but this is easy. You can even run it in loop because some pages may have many redirections. You can test it on httpbin.org (even for multi-redirections)
For third type it is easy to check if HTML has meta tag and run next request with new url. And again you can run in loop because some pages may have many redirections.
But forth type makes problem because requests can't run JavaScript and there are many different methods to assign new location. They can also hide it in code - "obfuscation".
In requests you can check response.history to see executed redirections

Requests not loading the content as web Browser gives Python

Hay ! I am new here so let me describe clearly my issue,Please Ignore mistakes.
I am making request on a page which literlaly works on js.
Acually its the page of paytm payemnt response through UPI.
When ever i do the requests the response is {'POLL_STATUS':"STOP_POLLING"}
But the problem is the reqest is giving this response while the browser is giving another response with loaded html.
I tried everyting like stopeed redirects and printing raw content nothing works.
I just think may be urllib post request may be work but i do not know the uses.
Can anyone please tell me how to get the exact html response as the browser gives.
Note[0]:Please dont provide answer of selenium because this issue having in the middle of my script.
Note[1]:Friendly answer appriciated.
for i in range(0,15):
resp_check_transaction=self.s.post("https://secure.website.in/theia/upi/transactionStatus?MID="+str(Merchant_ID)+"&ORDER_ID="+str(ORDER_ID),headers=check_transaction(str(ORDER_ID)),data=check_transaction_payload(Merchant_ID,ORDER_ID,TRANSID,CASHIERID))
print(resp_check_transaction.text)
resp_check_transaction=resp_check_transaction.json()
if resp_check_transaction['POLL_STATUS']=="STOP_POLLING":
print("Breaking looop")
break
time.sleep(4)
self.clear_header()
parrms={
"MID": str(Merchant_ID),
"ORDER_ID": str(ORDER_ID)
}
resp_transaction_pass=requests.post("https://secure.website.in/theia/upi/transactionStatus",headers=transaction_pass(str(ORDER_ID)),data=transaction_pass_payload(CASHIERID,UPISTATUSURL,Merchant_ID,ORDER_ID,TRANSID,TXN_AMOUNT),params=parrms,allow_redirects=True)
print("Printing response")
print(resp_transaction_pass.text)
print(resp_transaction_pass.content)
And in the web browser its showing that Status Code: 302 Moved Temporarily in the bank response of Bank response. :(
About the 302 status code
You mention that the web browser is sends a 302 status code in response to the request. In the simplest terms the 302 status code is just the web servers way of saying "Hey I know what you're looking for but it is actually located at this other URL.".
Basically all modern browsers and HTTP request libraries like Python's Requests will automatically follow a 302 redirect and act as though you send the request to the new URL instead. (Your browser's developer tools may show that a 302 redirect has happened but as far as the JavaScript is concerned it just got a normal 200 response).
If you really want to see if your Python script receives a 302 status you can do so by setting the allow_redirects option to False, but this means you will manually have to get the stuff from the new URL.
import requests
r1 = requests.get('https://httpstat.us/302', allow_redirects=False)
r2 = requests.get('https://httpstat.us/302', allow_redirects=True)
print('No redirects:', r1.status_code) # 302
print('Redirects on:', r2.status_code) # 200 (status code of page it redirects to)
Note that allow_redirects is already set to True by default, I just wanted to make the example a bit more verbose so the difference is obvious.
So why is the response content different?
So even though the browser and the Requests library are both automatically following the 302 redirect the response they get is still different, you didn't share any screenshots of the browsers requests or responses so I can only give a few educated guesses but it boils down to the fact that the request made by your Python code is somehow different from the JavaScript loaded by the web browser.
Some things to consider:
Are you sure you are using the he correct HTTP method? Is the browser also making a POST request?
If so are you sure the body of the request is the same/of the same format as the one sent by the web browser?
Perhaps the browser has a session cookie it is sending along with the request (Note this usually not explicitly said in the JS but happens automatically).
Alternatively the JS might include some API key/credentials in the HTTP auth header (this should be explicitly visible in JS).
Although unlikely it could be that whatever API you're trying to query is trying to block reverse engineering attempts by blocking the Requests library's user agent string.
Luckily all of these differences can be easily examined with some print statements and your browser's developer tools :p.

Use Python to make HTTP Post Request for form

I am trying to make an HTTP Post request using Python. The specific form I want to submit is on the following page: http://143.137.111.105/Enlace/Resultados2010/Basica2010/R10Folio.aspx
Using Chrome Dev Tools it seems like pushing the button makes an HTTP Post request but I am trying to figure out the exact request that is made. I currently have the following in Python:
import requests
url = 'http://143.137.111.105/Enlace/Resultados2010/Basica2010/R10Folio.aspx'
values = {
'txtFolioAlumno': '210227489P10',
}
r = requests.post(url, values)
print r.content
However, when I run this it simply prints out the HTML of the old page instead of returning the data from the new page (I am interested in getting the number next to 'Matematicas', 422 in this case). I have achieved this task using Selenium which actually opens a test browser, but I want to query the server directly.

Categories

Resources