I'm trying to scrape some data from an online GIS system that uses XML. I was able to whip up a quick script using the requests library that successfully posted a payload and returned a HTTP 200 with the correct results but when moving the request over to scrapy, I continually get a 413. I inspected the two requests using Wireshark and found a few differences, though I'm not totally sure I understand them.
The request in scrapy looks like:
yield Request(
self.parcel_number_url,
headers={'Accept': '*/*',
'Accept-Encoding': 'gzip,deflate,sdch',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive',
'Content-Length': '823',
'Content-Type': 'application/xml',
'Host': 'xxxxxxxxxxxx',
'Origin': 'xxxxxxxxxxx',
'Referer': 'xxxxxxxxxxxx',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest'},
method='POST',
cookies={'_ga': 'GA1.3.1332485584.1402003562', 'PHPSESSID': 'tpfn5s4k3nagnq29hqrolm2v02'},
body=PAYLOAD,
callback=self.parse
)
The packets I inspected are located here: http://justpaste.it/fxht
That includes the HTTP request when using the requests library and the HTTP request when yielding a scrapy Request object. The request seems to be larger when using scrapy, it looks like the 2nd TCP segment is 21 bytes larger than the 2nd TCP segment when using the requests library. The Content-Length header gets set twice in the scrapy request as well.
Has anyone ever experienced this kind of problem with scrapy? I've never gotten a 413 scraping anything before.
I resolved this by removing the cookies and not setting the "Content-Length" header manually on my yielded request. It seems like those 2 things were the extra 21 bytes on the 2nd TCP segment and caused the 413 response. Maybe the server was interpreting the "Content-Length" as the combined value of the 2 "Content-Length" headers and therefore returning a 413, but I'm not certain.
Related
So I'm making a function that would return if i can watch a certain anime on in this case Crunchyroll, but after looking for the right solution for a couple of days i could not quite find the answer. And I am very new to webscraping, so i don't have any experience.
Here are the headers i've added. The only one that makes a difference right now is the host header.
header = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-GB,en;q=0.9,nl;q=0.8,ja;q=0.7',
'connection': 'keep-alive',
'cache-control': 'max-age=0',
'dnt': '1',
'sec-fetch-dest': 'document',
'sec-ch-ua-mobile': '?0',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'sec-ch-ua': '"Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99"',
'host': 'crunchyroll.com',
'referer': 'https://www.google.com/',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36'
}
And here is the last version i tried. I have already tried to work with the request module, but it keeps giving me the exceeding 20 redirects error. After using 'allow_redirects = False' it gives me the 301 error, but if there is a solution through the request module i'd be happy too.
(namelist is for example: [rezero-kara-hajimeru-isekai-seikatsu, rezero-starting-life-in-another-world-, rezero])
for i in namelist:
# Crunchyroll Checker
Cleanlink = 'https://www.crunchyroll.com/en-gb/'
attempt = Cleanlink + i
try:
req = urllib.request.Request(attempt, headers=header)
cj = CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj), urllib.request.HTTPRedirectHandler)
response = opener.open(req)
response.close()
print(response)
except urllib.request.HTTPError as inst:
output = format(inst)
print(output)
This code gives me this response:
The last 30x error message was:
Moved Permanently
The last 30x error message was:
Moved Permanently
HTTP Error 301: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Moved Permanently
So the only thing i need is that the code is able to check if a website exists. For example: https://www.crunchyroll.com/en-gb/rezero-starting-life-in-another-world- should return the 200 code.
Thanks in advance.
Crunchyroll has a very strict CloudFlare WAF. If your requests have something fishy, it will give you a hard time right away. Possible reasons why you get 301 are
Maybe you are requesting behind a proxy
WAF may check if the requester (urllib, requests module in your case) has javascript enable (so it can tell if requester are bot or real user).
Solution
You should use this Python lib to do the request for you. It is a wrapper around requests lib in Python, so we can use it as if using requests module in python.
https://pypi.org/project/cloudscraper/
Note
Even with this lib, you can only send few hundreds requests per few hours. Because WAF will still detect that your IP is requesting too much and block you. Crunchyroll's WAF is nasty.
I am trying to make a program that checks for ski lift reservation openings. So far I am able to get the correct response from the API but it only works for about 15 min before some cookie expires. Here is my current process.
Go to site: https://www.keystoneresort.com/plan-your-trip/lift-access/tickets.aspx and look at the network response, then I copy the highlighted xhr script as a curl(bash).
website/api in question
I then take that curl(bash) and import it into postman and get the response:
Postman response
Then I take the code from postman so I can run it in python
Code used by postman
import requests, json
url = "https://www.keystoneresort.com/api/LiftAccessApi/GetLiftTicketControlReservationInventory?
startDate=01%2F21%2F2021&endDate=03%2F06%2F2021&_=1611254694375"
payload={}
headers = {
'authority': 'www.keystoneresort.com',
'accept': 'application/json, text/javascript, */*; q=0.01',
'x-queueit-ajaxpageurl': 'https%3A%2F%2Fwww.keystoneresort.com%2Fplan-your-trip%2Flift-
access%2Ftickets.aspx%3FstartDate%3D01%252F23%252F2021%26numberOfDays%3D1%26ageGroup%3DAdult',
'x-requested-with': 'XMLHttpRequest',
'__requestverificationtoken': 'mbVIzNL1qZUKDT3Re8H9kXVNoYLmQPC-tgLCSbM_inVSN1v_2Pei-A- GWDaKL7i6NRIVTr0lnlmiYACNvfmd6Zzsikk1:HI8y8wZJXMuP7nsTJwS-adYZu7FoHVPVHWY5naHRiB71dg2PzehuQa8WJy418eIrVqwmvhw-a1F34sJ425mXzWpEANE1',
'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Mobile Safari/537.36',
'save-data': 'off',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://www.keystoneresort.com/plan-your-trip/lift-access/tickets.aspx? startDate=01%2F23%2F2021&numberOfDays=1&ageGroup=Adult',
'accept-language': 'en-US,en;q=0.9',
'cookie': 'QueueITAccepted-SDFrts345E-V3_vailresortsecomm1=EventId%3Dvailresortsecomm1%26QueueId%3D96d15411-09e1-4443-89a3-f0d6e4cef5d5%26RedirectType%3Dsafetynet%26IssueTime%3D1611254692%26Hash%3D06e1aecd2d5cdf64363d53f4fc63f1c22316f604895cd3ecfd1d8b03f86ba36a; TS019b45a2=01d73c084b0f6abf04d77ffeb9e37953f3d047ebae13a4f5ffa8e69045bf156b4959e093cf10f08359c6f45a491fdc474e068898a9; TS01f060ff=01d73c084b0f6abf04d77ffeb9e37953f3d047ebae13a4f5ffa8e69045bf156b4959e093cf10f08359c6f45a491fdc474e068898a9; AMCV_974C370453295F9A0A490D44%40AdobeOrg=1406116232%7CMCIDTS%7C18649%7CMCMID%7C30886069937558409272202898840476568322%7CMCAAMLH-1611859494%7C9%7CMCAAMB-1611859494%7CRKhpRz8krg2tLO6pguXWp5olkAcUniQYPHaMWWgdJ3xzPWQmdj0y%7CMCOPTOUT-1611261894s%7CNONE%7CMCAID%7CNONE%7CvVersion%7C2.5.0;'
}
s = requests.Session()
y = s.get(url)
print(y)
response = requests.request("GET", url, headers=headers, data=payload)
todos = json.loads(response.text)
x = json.dumps(todos, indent = 2)
print(x)
Now if you run this in python, it will not work because the cookies will have expired for this session by the time someone tries it. So you would have to follow the process I listed above if you want to see what I am doing. The response I get looks like this, which is what I want but only for it not to expire.
Python response
I have looked extensively at different ways I can get the cookies using requests and selneium. All solutions I have tried only get some of the cookies and not all of them. I need the ones that are in the "cookie" header listed in my code, but I have not found a way to do that without refreshing the page and posting the curl in postman and copying the response. I am still fairly new to python and coding in general so don't go to hard on me if the answer is super simple.
I think some of these cookies are rendered by java script, which may be part of the problem. I can also delete some of the cookies in my code and have it still work(until it expires). If there is an easier way to do what I am doing please let me know.
Thanks.
I am trying to crawl a website and copied the Request Headers information from Chrome directly,however, after using the requests.get, the returned content is empty.But the header I printed from requests is correct. Anyone knows the reason for this? Thx!
Mac, Chrome, Python3.7
General InformationRequests Information
import requests
headers = {
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Content-Type': 'application/x-www-form-urlencoded; charset=utf-8',
'Cookie': '_RSG=Ja4TD8hvFh2MGc7wBysunA; _RDG=28458f5367f9b123363c043b75e3f9aa31; _RGUID=2acfe6b2-0d74-4913-ac78-dbc2fa1e6416; _abtest_userid=bce0b01e-fdb6-48c8-9b86-4e1d8ef468df; _ga=GA1.2.937100695.1547968515; Session=SmartLinkCode=U155952&SmartLinkKeyWord=&SmartLinkQuery=&SmartLinkHost=&SmartLinkLanguage=zh; HotelCityID=5split%E5%93%88%E5%B0%94%E6%BB%A8splitHarbinsplit2019-01-25split2019-01-26split0; Mkt_UnionRecord=%5B%7B%22aid%22%3A%224897%22%2C%22timestamp%22%3A1548157938143%7D%5D; ASP.NET_SessionId=w1pq5dvchogxhbnxzmbgbtkk; OID_ForOnlineHotel=1509697509766jepc81550141458933102003; _RF1=123.165.147.203; MKT_Pagesource=PC; HotelDomesticVisitedHotels1=698432=0,0,4.5,3674,/hotel/8000/7899/df84daa197dd4b868868cba4db14f71f.jpg,&448367=0,0,4.3,4455,/fd/hotel/g6/M02/6D/8B/CggYtFc1nAKAEnRYAAdgA-rkEXw300.jpg,&13679014=0,0,4.9,1484,/200g0w000000k4wqrB407.jpg,; __zpspc=9.6.1550232718.1550232718.1%234%7C%7C%7C%7C%7C%23; _jzqco=%7C%7C%7C%7C1550232718632%7C1.2024536341.1547968514847.1550141461869.1550232718448.1550141461869.1550232718448.undefined.0.0.13.13; _gid=GA1.2.506035914.1550232719; _bfi=p1%3D102003%26p2%3D102003%26v1%3D18%26v2%3D17; appFloatCnt=8; _bfa=1.1509697509766.jepc8.1.1550141458610.1550232715314.7.19; _bfs=1.2',
'Host': 'hotels.ctrip.com',
'Referer': 'http://hotels.ctrip.com/hotel/698432.html?isFull=F',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36'
}
url ='http://hotels.ctrip.com/Domestic/tool/AjaxHotelCommentList.aspx?MasterHotelID=698432&hotel=698432&property=0&card=0&cardpos=0&NewOpenCount=0&AutoExpiredCount=0&RecordCount=3663&OpenDate=2015-01-01¤tPage=1&orderBy=2&viewVersion=c&eleven=cb6ab06dc6aff1e215d71d006e6de92d3cb1428213f72763175fe035341c4f61&callback=CASTdHqLYNMOfGFbr&_=1550303542815'
data = requests.get(url, headers = headers)
print(data.request.headers)
The request header information that you shared in the image, gives the info that the server responded correctly to the request. Also the actual url that you shared http://hotels.ctrip.com/Domestic/tool/AjaxHotelCommentList.aspx?MasterHotelID=698432&hotel=698432&property=0&card=0&cardpos=0&NewOpenCount=0&AutoExpiredCount=0&RecordCount=3663&OpenDate=2015-01-01¤tPage=1&orderBy=2&viewVersion=c&eleven=cb6ab06dc6aff1e215d71d006e6de92d3cb1428213f72763175fe035341c4f61&callback=CASTdHqLYNMOfGFbr&_=1550303542815
was something different from the one shown in the image. Infact it seems the actual page is indeed calling lot of other urls to form the final page. so there is no guarantee that you will get the response as you see in the browser when you use requests. If the server or the actual implementation at the server end is depending on the browser's javascript engine to execute the javascript and then render the content, you won't be able to get the final html as it looks like in the browser. Would be better to use selenium webdriver in those cases to hit the url and then get the html content. Again if you can share the actual url, can suggest on other ideas
I am dealing with this little error but I can not get the solution. I authenticate into a page and I had opened the "inspect/network" chrome tool to see what web service is called and how. I found out this is used:
I have censored sensitive data releated to the site. So, I have to do this same request using python, but I am always getting error 500 and the log on the server side is not showing helpful information (only java traceback).
This is the code of the request
response = requests.post(url,data = 'username=XXXXX&password=XXXXXXX')
URL has the same string that you see in the image under "General/Request URL" label.
Data has the same string that you see in the image under "Form Data".
It looks very simple request but I can not get it to work :( .
Best regards
If you want your request appears like coming from Chrome, other than sending correct data you need to specify headers as well. The reason you got 500 error is probably there're certain settings on your server side disallowing traffic from "non-browsers".
So in your case, you need to add headers:
headers = {'Accept': 'application/json, text/plain, */*',
'Accept-Encoding': gzip, deflate,
...... # more
'User-Agent': 'Mozilla/5.0 XXXXX...' # this line tells the server what browser/agent is used for this request
}
response = requests.post(url,data = 'username=XXXXX&password=XXXXXXX', headers=headers)
P.S. If you are curious, default headers from requests are:
>>> import requests
>>> session = requests.Session()
>>> session.headers
{'Connection': 'keep-alive', 'Accept-Encoding': 'gzip, deflate',
'Accept': '*/*', 'User-Agent': 'python-requests/2.13.0'}
As you can see the default User-Agent is python-requests/2.13.0, and some websites do block such traffic.
I'm looking to write a script that can automatically download .zip files from the Bureau of Transportation Statistics Carrier Website, but I'm having trouble getting the same response headers as I can see in Chrome when I download the zip file. I'm looking to get a response header that looks like this:
HTTP/1.1 302 Object moved
Cache-Control: private
Content-Length: 183
Content-Type: text/html
Location: http://tsdata.bts.gov/103627300_T_T100_SEGMENT_ALL_CARRIER.zip
Server: Microsoft-IIS/8.5
X-Powered-By: ASP.NET
Date: Thu, 21 Apr 2016 15:56:31 GMT
However, when calling requests.post(url, data=params, headers=headers) with the same information that I can see in the Chrome network inspector I am getting the following response:
>>> res.headers
{'Cache-Control': 'private', 'Content-Length': '262', 'Content-Type': 'text/html', 'X-Powered-By': 'ASP.NET', 'Date': 'Thu, 21 Apr 2016 20:16:26 GMT', 'Server': 'Microsoft-IIS/8.5'}
It's got pretty much everything except it's missing the Location key that I need in order to download the .zip file with all of the data I want. Also the Content-Length value is different, but I'm not sure if that's an issue.
I think that my issue has something to do with the fact that when you click "Download" on the page it actually sends two requests that I can see in the Chrome network console. The first request is a POST request that yields an HTTP response of 302 and then has the Location in the response header. The second request is a GET request to the url specified in the Location value of the response header.
Should I really be sending two requests here? Why am I not getting the same response headers using requests as I do in the browser? FWIW I used curl -X POST -d /*my data*/ and got back this in my terminal:
<head><title>Object moved</title></head>
<body><h1>Object Moved</h1>This object may be found here.</body>
Really appreciate any help!
I was able to download the zip file that I was looking for by using almost all of the headers that I could see in the Google Chrome web console. My headers looked like this:
{'Connection': 'keep-alive', 'Cache-Control': 'max-age=0', 'Referer': 'http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=293', 'Origin': 'http://www.transtats.bts.gov', 'Upgrade-Insecure-Requests': 1, 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36', 'Cookie': 'ASPSESSIONIDQADBBRTA=CMKGLHMDDJIECMNGLMDPOKHC', 'Accept-Language': 'en-US,en;q=0.8', 'Accept-Encoding': 'gzip, deflate', 'Content-Type': 'application/x-www-form-urlencoded'}
And then I just wrote:
res = requests.post(url, data=form_data, headers=headers)
where form_data was copied from the "Form Data" section of the Chrome console. Once I got that request, I used the zipfile and io modules to parse the content of the response stored in res. Like this:
import zipfile, io
zipfile.ZipFile(io.BytesIO(res.content))
and then the file was in the directory where I ran the Python code.
Thanks to the users who answered on this thread.