Python XHR Request Timing Out - python

Trying to wrap my head around using requests to get Javscript loaded content without spawning an actual browser to render it. I'm looking at using the requests lib to get the tables but I keep getting a 504 with my test code and I'm not 100% why.
So I'm looking at getting horse racing data from: sports.betway.com/#/horse-racing/uk-and-ireland/haydock
I watched the network traffic and found the source of the traffic. It's a call to /emoapi/emos with an eventIds number.
I tried this:
import requests
url = 'https://sports.betway.com/emoapi/emos'
params = {
'eventIds': '807789',
'lang': 'en'
}
headers = {'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive',
'Content-Length': '271',
'Content-Type': 'application/json',
'Host': 'sports.betway.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36'}
#Note: I do also set the origin and ref link in the header but I can't post that many links in a question.
response = requests.post(url, params=params, headers=headers)
print response
fixtures = response.json()
print fixtures
I can't see what else I'm missing from the request. But the print response comes back as a
This is an example of the full payload on the browser header which requests a whole bunch of Ids rather than just the one I'm trying:
{"eventIds":[807789,808612,808597,807790,808613,808598,807791,808611,808599,807792,808614,808600,807793,808615,808601,807794,808616,808602,807795,808617,807781,808591,807782,808589,807783,808590,807785,808592,807784,808593,807786,808594,807788,808595,807787],"lang":"en"}
And it's a POST to that URL so I'm not sure why it's timing out.
Can anyone shed any light on where I'm going wrong here? Is it something painfully obvious?

The payload should be included in request body rather than url params.
The payload in this case is a json raw string.
import requests
url = 'https://sports.betway.com/emoapi/emos'
data = '{"eventIds": [807789]}'
response = requests.post(url, data=data )
print response.text

Related

Not getting all cookies from Javascript site (python)

I am trying to make a program that checks for ski lift reservation openings. So far I am able to get the correct response from the API but it only works for about 15 min before some cookie expires. Here is my current process.
Go to site: https://www.keystoneresort.com/plan-your-trip/lift-access/tickets.aspx and look at the network response, then I copy the highlighted xhr script as a curl(bash).
website/api in question
I then take that curl(bash) and import it into postman and get the response:
Postman response
Then I take the code from postman so I can run it in python
Code used by postman
import requests, json
url = "https://www.keystoneresort.com/api/LiftAccessApi/GetLiftTicketControlReservationInventory?
startDate=01%2F21%2F2021&endDate=03%2F06%2F2021&_=1611254694375"
payload={}
headers = {
'authority': 'www.keystoneresort.com',
'accept': 'application/json, text/javascript, */*; q=0.01',
'x-queueit-ajaxpageurl': 'https%3A%2F%2Fwww.keystoneresort.com%2Fplan-your-trip%2Flift-
access%2Ftickets.aspx%3FstartDate%3D01%252F23%252F2021%26numberOfDays%3D1%26ageGroup%3DAdult',
'x-requested-with': 'XMLHttpRequest',
'__requestverificationtoken': 'mbVIzNL1qZUKDT3Re8H9kXVNoYLmQPC-tgLCSbM_inVSN1v_2Pei-A- GWDaKL7i6NRIVTr0lnlmiYACNvfmd6Zzsikk1:HI8y8wZJXMuP7nsTJwS-adYZu7FoHVPVHWY5naHRiB71dg2PzehuQa8WJy418eIrVqwmvhw-a1F34sJ425mXzWpEANE1',
'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Mobile Safari/537.36',
'save-data': 'off',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://www.keystoneresort.com/plan-your-trip/lift-access/tickets.aspx? startDate=01%2F23%2F2021&numberOfDays=1&ageGroup=Adult',
'accept-language': 'en-US,en;q=0.9',
'cookie': 'QueueITAccepted-SDFrts345E-V3_vailresortsecomm1=EventId%3Dvailresortsecomm1%26QueueId%3D96d15411-09e1-4443-89a3-f0d6e4cef5d5%26RedirectType%3Dsafetynet%26IssueTime%3D1611254692%26Hash%3D06e1aecd2d5cdf64363d53f4fc63f1c22316f604895cd3ecfd1d8b03f86ba36a; TS019b45a2=01d73c084b0f6abf04d77ffeb9e37953f3d047ebae13a4f5ffa8e69045bf156b4959e093cf10f08359c6f45a491fdc474e068898a9; TS01f060ff=01d73c084b0f6abf04d77ffeb9e37953f3d047ebae13a4f5ffa8e69045bf156b4959e093cf10f08359c6f45a491fdc474e068898a9; AMCV_974C370453295F9A0A490D44%40AdobeOrg=1406116232%7CMCIDTS%7C18649%7CMCMID%7C30886069937558409272202898840476568322%7CMCAAMLH-1611859494%7C9%7CMCAAMB-1611859494%7CRKhpRz8krg2tLO6pguXWp5olkAcUniQYPHaMWWgdJ3xzPWQmdj0y%7CMCOPTOUT-1611261894s%7CNONE%7CMCAID%7CNONE%7CvVersion%7C2.5.0;'
}
s = requests.Session()
y = s.get(url)
print(y)
response = requests.request("GET", url, headers=headers, data=payload)
todos = json.loads(response.text)
x = json.dumps(todos, indent = 2)
print(x)
Now if you run this in python, it will not work because the cookies will have expired for this session by the time someone tries it. So you would have to follow the process I listed above if you want to see what I am doing. The response I get looks like this, which is what I want but only for it not to expire.
Python response
I have looked extensively at different ways I can get the cookies using requests and selneium. All solutions I have tried only get some of the cookies and not all of them. I need the ones that are in the "cookie" header listed in my code, but I have not found a way to do that without refreshing the page and posting the curl in postman and copying the response. I am still fairly new to python and coding in general so don't go to hard on me if the answer is super simple.
I think some of these cookies are rendered by java script, which may be part of the problem. I can also delete some of the cookies in my code and have it still work(until it expires). If there is an easier way to do what I am doing please let me know.
Thanks.

Python request to crawl URL returns 404 Error while working inside the browser

I have a crawling python script that hangs on a url: pulsepoint.com/sellers.json
The bot uses a standard request to get the content, but is returned Error 404. In the browser it works (there is a 301 redirect, but request can follow that). My first hunch is that this could be a request header issue, so I copied my browser configuration. The code looks like this
crawled_url="pulsepoint.com"
seller_json_url = 'http://{thehost}/sellers.json'.format(thehost=crawled_url)
print(seller_json_url)
myheaders = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache'
}
r = requests.get(seller_json_url, headers=myheaders)
logging.info(" %d" % r.status_code)
But I am still getting a 404 Error.
My next guess:
Login? Not used here
Cookies? Not that I can see
So how is their server blocking my bot? This is an url that is supposed to be crawled by the way, nothing illegal..
Thanks in advance!
You can also do a workaround on the SSL certificate error like below:
from urllib.request import urlopen
import ssl
import json
#this is a workaround on the SSL error
ssl._create_default_https_context = ssl._create_unverified_context
crawled_url="pulsepoint.com"
seller_json_url = 'http://{thehost}/sellers.json'.format(thehost=crawled_url)
print(seller_json_url)
response = urlopen(seller_json_url).read()
# print in dictionary format
print(json.loads(response))
Sample response:
{'contact_email': 'PublisherSupport#pulsepoint.com', 'contact_address': '360 Madison Ave, 14th Floor, NY, NY, 10017', 'version': '1.0', 'identifiers': [{'name': 'TAG-ID', 'value': '89ff185a4c4e857c'}], 'sellers': [{'seller_id': '508738', ...
...'seller_type': 'PUBLISHER'}, {'seller_id': '562225', 'name': 'EL DIARIO', 'domain': 'impremedia.com', 'seller_type': 'PUBLISHER'}]}
You can just go directly to the link and extract the data, no need to get 301 to the correct link
import requests
headers = {"Upgrade-Insecure-Requests": "1"}
response = requests.get(
url="https://projects.contextweb.com/sellersjson/sellers.json",
headers=headers,
verify=False,
)
Ok, just for other people, an hardened version of âńōŋŷXmoůŜ's answer, because:
Some website want headers to answer;
Some website use weird encoding
Some website send gzipped answer when not requested.
import urllib
import ssl
import json
from io import BytesIO
import gzip
ssl._create_default_https_context = ssl._create_unverified_context
crawled_url="pulsepoint.com"
seller_json_url = 'http://{thehost}/sellers.json'.format(thehost=crawled_url)
req = urllib.request.Request(seller_json_url)
# ADDING THE HEADERS
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0')
req.add_header('Accept','application/json,text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8')
response = urllib.request.urlopen(req)
data=response.read()
# IN CASE THE ANSWER IS GZIPPED
if response.info().get('Content-Encoding') == 'gzip':
buf = BytesIO(data)
f = gzip.GzipFile(fileobj=buf)
data = f.read()
# ADAPTS THE ENCODING TO THE ANSWER
print(json.loads(data.decode(response.info().get_param('charset') or 'utf-8')))
Thanks again!

Why does the requests.get with correct header return empty content?

I am trying to crawl a website and copied the Request Headers information from Chrome directly,however, after using the requests.get, the returned content is empty.But the header I printed from requests is correct. Anyone knows the reason for this? Thx!
Mac, Chrome, Python3.7
General InformationRequests Information
import requests
headers = {
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Content-Type': 'application/x-www-form-urlencoded; charset=utf-8',
'Cookie': '_RSG=Ja4TD8hvFh2MGc7wBysunA; _RDG=28458f5367f9b123363c043b75e3f9aa31; _RGUID=2acfe6b2-0d74-4913-ac78-dbc2fa1e6416; _abtest_userid=bce0b01e-fdb6-48c8-9b86-4e1d8ef468df; _ga=GA1.2.937100695.1547968515; Session=SmartLinkCode=U155952&SmartLinkKeyWord=&SmartLinkQuery=&SmartLinkHost=&SmartLinkLanguage=zh; HotelCityID=5split%E5%93%88%E5%B0%94%E6%BB%A8splitHarbinsplit2019-01-25split2019-01-26split0; Mkt_UnionRecord=%5B%7B%22aid%22%3A%224897%22%2C%22timestamp%22%3A1548157938143%7D%5D; ASP.NET_SessionId=w1pq5dvchogxhbnxzmbgbtkk; OID_ForOnlineHotel=1509697509766jepc81550141458933102003; _RF1=123.165.147.203; MKT_Pagesource=PC; HotelDomesticVisitedHotels1=698432=0,0,4.5,3674,/hotel/8000/7899/df84daa197dd4b868868cba4db14f71f.jpg,&448367=0,0,4.3,4455,/fd/hotel/g6/M02/6D/8B/CggYtFc1nAKAEnRYAAdgA-rkEXw300.jpg,&13679014=0,0,4.9,1484,/200g0w000000k4wqrB407.jpg,; __zpspc=9.6.1550232718.1550232718.1%234%7C%7C%7C%7C%7C%23; _jzqco=%7C%7C%7C%7C1550232718632%7C1.2024536341.1547968514847.1550141461869.1550232718448.1550141461869.1550232718448.undefined.0.0.13.13; _gid=GA1.2.506035914.1550232719; _bfi=p1%3D102003%26p2%3D102003%26v1%3D18%26v2%3D17; appFloatCnt=8; _bfa=1.1509697509766.jepc8.1.1550141458610.1550232715314.7.19; _bfs=1.2',
'Host': 'hotels.ctrip.com',
'Referer': 'http://hotels.ctrip.com/hotel/698432.html?isFull=F',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36'
}
url ='http://hotels.ctrip.com/Domestic/tool/AjaxHotelCommentList.aspx?MasterHotelID=698432&hotel=698432&property=0&card=0&cardpos=0&NewOpenCount=0&AutoExpiredCount=0&RecordCount=3663&OpenDate=2015-01-01&currentPage=1&orderBy=2&viewVersion=c&eleven=cb6ab06dc6aff1e215d71d006e6de92d3cb1428213f72763175fe035341c4f61&callback=CASTdHqLYNMOfGFbr&_=1550303542815'
data = requests.get(url, headers = headers)
print(data.request.headers)
The request header information that you shared in the image, gives the info that the server responded correctly to the request. Also the actual url that you shared http://hotels.ctrip.com/Domestic/tool/AjaxHotelCommentList.aspx?MasterHotelID=698432&hotel=698432&property=0&card=0&cardpos=0&NewOpenCount=0&AutoExpiredCount=0&RecordCount=3663&OpenDate=2015-01-01&currentPage=1&orderBy=2&viewVersion=c&eleven=cb6ab06dc6aff1e215d71d006e6de92d3cb1428213f72763175fe035341c4f61&callback=CASTdHqLYNMOfGFbr&_=1550303542815
was something different from the one shown in the image. Infact it seems the actual page is indeed calling lot of other urls to form the final page. so there is no guarantee that you will get the response as you see in the browser when you use requests. If the server or the actual implementation at the server end is depending on the browser's javascript engine to execute the javascript and then render the content, you won't be able to get the final html as it looks like in the browser. Would be better to use selenium webdriver in those cases to hit the url and then get the html content. Again if you can share the actual url, can suggest on other ideas

Python script to download file from button on website

I want to download an xls file by clicking the button "Export to excel" from the following url: https://apps.tampagov.net/CallsForService_Webapp/Default.aspx?type=TPD.
More specifically the button: name = "ctl00$MainContent$btndata". I've already been able to do this using selenium, but, I plan on building a docker image with this script and running as a docker container because this xls is regularly updated and I need the most current data on my local machine and it doesn't make sense to have a browser open that often to fetch this data. I understand there are headless versions of chrome and firefox although I don't believe they support downloads. Also, I understand that web get will not work in this situation because the button is not a static link to the resource. Maybe there's a completely different approach for downloading and updating this data to my computer?
import urllib
import requests
from bs4 import BeautifulSoup
headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=.08',
'Origin': 'https://www.tampagov.net',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17',
'Content-Type': 'application/x-www-form-urlencoded',
'Referer': 'https://apps.tampagov.net/CallsForService_Webapp/Default.aspx?type=TPD',
'Accept-Encoding': 'gzip,deflate,br',
'Accept-Language': 'en-US,en;q=0.5',
}
class MyOpener(urllib.FancyURLopener):
version = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17'
myopener = MyOpener()
url = 'https://apps.tampagov.net/CallsForService_Webapp/Default.aspx?type=TPD'
# first HTTP request without form data
f = myopener.open(url)
soup = BeautifulSoup(f, "html.parser")
# parse and retrieve two vital form values
viewstate = soup.select("#__VIEWSTATE")[0]['value']
eventvalidation = soup.select("#__EVENTVALIDATION")[0]['value']
formData = (
('__EVENTVALIDATION', eventvalidation),
('__VIEWSTATE', viewstate),
('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'),
('Accept-Encoding', 'gzip, deflate, br'),
('Accept-Language', 'en-US,en;q=0.5'),
('Host', 'apps,tampagov.net'),
('User-Agent', 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:59.0) Gecko/20100101 Firefox/59.0'))
payload = urllib.urlencode(formData)
# second HTTP request with form data
r = requests.post("https://apps.tampagov.net/CallsForService_Webapp/Default.aspx?type=TPD", params=payload)
print(r.status_code, r.reason)
First: I removed import urllib because 'requests' is enough.
Some issues you have:
You don't need to create one nested tuple then apply urllib.urlencode, uses one dictionary instead that is one reason why requests is so popular.
You'd better populate all parameters for the http post request. like below what I did, otherwise, the request may be rejected by the backend.
I added one simple codes to save the content to the local.
PS: for those form parameters, you can get their values by analysis the html responsed from http get. Also you can customize the parameters as you need, like page size etc.
Below is a working sample:
import requests
from bs4 import BeautifulSoup
from tqdm import tqdm
def downloadExcel():
headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=.08',
'Origin': 'https://www.tampagov.net',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17',
'Content-Type': 'application/x-www-form-urlencoded',
'Referer': 'https://apps.tampagov.net/CallsForService_Webapp/Default.aspx?type=TPD',
'Accept-Encoding': 'gzip,deflate,br',
'Accept-Language': 'en-US,en;q=0.5',
}
r = requests.get("https://apps.tampagov.net/CallsForService_Webapp/Default.aspx?type=TPD", headers=headers)
# parse and retrieve two vital form values
if not r.status_code == 200:
print('Error')
return
soup = BeautifulSoup(r.content, "html.parser")
viewstate = soup.select("#__VIEWSTATE")[0]['value']
eventvalidation = soup.select("#__EVENTVALIDATION")[0]['value']
print ('__VIEWSTATE:', viewstate)
print ('__EVENTVALIDATION:', eventvalidation)
formData = {
'__EVENTVALIDATION': eventvalidation,
'__VIEWSTATE': viewstate,
'__EVENTTARGET': '',
'__EVENTARGUMENT': '',
'__VIEWSTATEGENERATOR': '49DF2C80',
'MainContent_RadScriptManager1_TSM':""";;System.Web.Extensions, Version=4.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35:en-US:59e0a739-153b-40bd-883f-4e212fc43305:ea597d4b:b25378d2;Telerik.Web.UI, Version=2015.2.826.40, Culture=neutral, PublicKeyToken=121fae78165ba3d4:en-US:c2ba43dc-851e-4009-beab-3032480b6a4b:16e4e7cd:f7645509:24ee1bba:c128760b:874f8ea2:19620875:4877f69a:f46195d3:92fe8ea0:fa31b949:490a9d4e:bd8f85e4:58366029:ed16cbdc:2003d0b8:88144a7a:1e771326:aa288e2d:b092aa46:7c926187:8674cba1:ef347303:2e42e72a:b7778d6c:c08e9f8a:e330518b:c8618e41:e4f8f289:1a73651d:16d8629e:59462f1:a51ee93e""",
'search_block_form':'',
'ctl00$MainContent$btndata':'Export to Excel',
'ctl00_MainContent_RadWindow1_C_RadGridVehicles_ClientState':'',
'ctl00_MainContent_RadWindow1_ClientState':'',
'ctl00_MainContent_RadWindowManager1_ClientState':'',
'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl00$PageSizeComboBox':'20',
'ctl00_MainContent_RadGrid1_ctl00_ctl02_ctl00_PageSizeComboBox_ClientState':'',
'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$RDIPFdispatch_time':'',
'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$RDIPFdispatch_time$dateInput':'',
'ctl00_MainContent_RadGrid1_ctl00_ctl02_ctl02_RDIPFdispatch_time_dateInput_ClientState':'{"enabled":true,"emptyMessage":"","validationText":"","valueAsString":"","minDateStr":"1900-01-01-00-00-00","maxDateStr":"2099-12-31-00-00-00","lastSetTextBoxValue":""}',
'ctl00_MainContent_RadGrid1_ctl00_ctl02_ctl02_RDIPFdispatch_time_ClientState':'{"minDateStr":"1900-01-01-00-00-00","maxDateStr":"2099-12-31-00-00-00"}',
'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$RadComboBox1address':'',
'ctl00_MainContent_RadGrid1_ctl00_ctl02_ctl02_RadComboBox1address_ClientState':'',
'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$RadComboBox1case_description':'',
'ctl00_MainContent_RadGrid1_ctl00_ctl02_ctl02_RadComboBox1case_description_ClientState':'',
'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$FilterTextBox_grid':'',
'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$RadComboBox1report_number':'',
'ctl00_MainContent_RadGrid1_ctl00_ctl02_ctl02_RadComboBox1report_number_ClientState':'',
'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$FilterTextBox_out_max_date':'',
'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$FilterTextBox_out_rowcount':'',
'ctl00$MainContent$RadGrid1$ctl00$ctl03$ctl01$PageSizeComboBox':'20',
'ctl00_MainContent_RadGrid1_ctl00_ctl03_ctl01_PageSizeComboBox_ClientState':'',
'ctl00_MainContent_RadGrid1_rfltMenu_ClientState':'',
'ctl00_MainContent_RadGrid1_gdtcSharedTimeView_ClientState':'',
'ctl00_MainContent_RadGrid1_gdtcSharedCalendar_SD':'[]',
'ctl00_MainContent_RadGrid1_gdtcSharedCalendar_AD':'[[1900,1,1],[2099,12,31],[2018,3,29]]',
'ctl00_MainContent_RadGrid1_ClientState':'',
}
# second HTTP request with form data
r = requests.post("https://apps.tampagov.net/CallsForService_Webapp/Default.aspx?type=TPD", data=formData, headers=headers)
print('received:', r.status_code, len(r.content))
with open(r"C:\Users\xxx\Desktop\test\test\apps.xls", "wb") as handle:
for data in tqdm(r.iter_content()):
handle.write(data)
downloadExcel()
Find out the URL you need to fetch as #Sphinx explains, and then simulate it using something similar to:
import urllib.request
import urllib.parse
data = urllib.parse.urlencode({...})
data = data.encode('ascii')
with urllib.request.urlopen("http://...", data) as fd:
print(fd.read().decode('utf-8'))
Take a look at the documentation of urllib.

Python Requests declined: "Due to the presence of characters known to be used in Cross Site Scripting attacks, access is forbidden."

Dear fellow requests users,
Update:
Sorry, guys. My error came from a mistake:
My goal was to do this:
r = requests.get('http://www.spdrs.com/product/fund.seam?ticker=SPY', stream=True, headers=hdr)
I did this:
r = requests.get('http://www.spdrs.com/product/fund.seam?ticker={}'.format(['SPY']), stream=True, headers=hdr)
Which should be:
r = requests.get('http://www.spdrs.com/product/fund.seam?ticker={}'.format('SPY'), stream=True, headers=hdr)
The extra brackets [] were causing the error, apparently. Dumb mistake. Feel free to vote me down, if you wish.
Original question:
I am trying to scrape spdrs.com webpage using:
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
r = requests.get('http://www.spdrs.com/product/fund.seam?ticker=SPY', stream=True, headers=hdr)
But all I get is this:
Due to the presence of characters known to be used in Cross Site Scripting attacks, access is forbidden.
It's the same with http or https.
If I remove hdr, I get a straight 403 decline.
Is there any modification I can do to the hdr to show the website that I am a well-behaving script? I know, servers don't like scrapers.
This thread on SO shows a similar problem from webmaster's perspective.
Thanks a lot!
Yi
Drop the following key/value from your header:
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
It works for me after I did that.

Categories

Resources