I'm trying to get a POST request, but I don't know what's wrong with my code that the data doesn't come.
The following message is displayed:
HTTP status code is not handled or not allowed
This is the website
A screenshot of the header:
This is my code:
import json
import scrapy
class MySpider(scrapy.Spider):
name = 'pb'
payload = {"version":"1.0.0","queries":[{"Query":{"Commands":[{"SemanticQueryDataShapeCommand":{"Query":{"Version":2,"From":[{"Name":"e","Entity":"Events"},{"Name":"d","Entity":"DAX"}],"Select":[{"Column":{"Expression":{"SourceRef":{"Source":"e"}},"Property":"Date Start"},"Name":"Events.Date Start"},{"Column":{"Expression":{"SourceRef":{"Source":"e"}},"Property":"Event Type"},"Name":"Events.Event Type"},{"Column":{"Expression":{"SourceRef":{"Source":"e"}},"Property":"Name"},"Name":"Events.Name"},{"Measure":{"Expression":{"SourceRef":{"Source":"d"}},"Property":"Length"},"Name":"Events.Total Days"},{"Column":{"Expression":{"SourceRef":{"Source":"e"}},"Property":"Location"},"Name":"Events.Location"},{"Column":{"Expression":{"SourceRef":{"Source":"e"}},"Property":"Link to Event"},"Name":"Events.Link to Event"},{"Measure":{"Expression":{"SourceRef":{"Source":"d"}},"Property":"Days Until Event"},"Name":"DAX.Days Until"},{"Column":{"Expression":{"SourceRef":{"Source":"e"}},"Property":"Link to Submit"},"Name":"Events.Link to Submit"},{"Measure":{"Expression":{"SourceRef":{"Source":"d"}},"Property":"Event Type Number"},"Name":"DAX.Event Type Number"}],"OrderBy":[{"Direction":1,"Expression":{"Column":{"Expression":{"SourceRef":{"Source":"e"}},"Property":"Date Start"}}}]},"Binding":{"Primary":{"Groupings":[{"Projections":[0,1,2,3,4,5,6,7,8]}]},"DataReduction":{"DataVolume":3,"Primary":{"Window":{"Count":500}}},"Aggregates":[{"Select":3,"Aggregations":[{"Min":{}},{"Max":{}}]}],"SuppressedJoinPredicates":[8],"Version":1}}}]},"CacheKey":"{\"Commands\":[{\"SemanticQueryDataShapeCommand\":{\"Query\":{\"Version\":2,\"From\":[{\"Name\":\"e\",\"Entity\":\"Events\"},{\"Name\":\"d\",\"Entity\":\"DAX\"}],\"Select\":[{\"Column\":{\"Expression\":{\"SourceRef\":{\"Source\":\"e\"}},\"Property\":\"Date Start\"},\"Name\":\"Events.Date Start\"},{\"Column\":{\"Expression\":{\"SourceRef\":{\"Source\":\"e\"}},\"Property\":\"Event Type\"},\"Name\":\"Events.Event Type\"},{\"Column\":{\"Expression\":{\"SourceRef\":{\"Source\":\"e\"}},\"Property\":\"Name\"},\"Name\":\"Events.Name\"},{\"Measure\":{\"Expression\":{\"SourceRef\":{\"Source\":\"d\"}},\"Property\":\"Length\"},\"Name\":\"Events.Total Days\"},{\"Column\":{\"Expression\":{\"SourceRef\":{\"Source\":\"e\"}},\"Property\":\"Location\"},\"Name\":\"Events.Location\"},{\"Column\":{\"Expression\":{\"SourceRef\":{\"Source\":\"e\"}},\"Property\":\"Link to Event\"},\"Name\":\"Events.Link to Event\"},{\"Measure\":{\"Expression\":{\"SourceRef\":{\"Source\":\"d\"}},\"Property\":\"Days Until Event\"},\"Name\":\"DAX.Days Until\"},{\"Column\":{\"Expression\":{\"SourceRef\":{\"Source\":\"e\"}},\"Property\":\"Link to Submit\"},\"Name\":\"Events.Link to Submit\"},{\"Measure\":{\"Expression\":{\"SourceRef\":{\"Source\":\"d\"}},\"Property\":\"Event Type Number\"},\"Name\":\"DAX.Event Type Number\"}],\"OrderBy\":[{\"Direction\":1,\"Expression\":{\"Column\":{\"Expression\":{\"SourceRef\":{\"Source\":\"e\"}},\"Property\":\"Date Start\"}}}]},\"Binding\":{\"Primary\":{\"Groupings\":[{\"Projections\":[0,1,2,3,4,5,6,7,8]}]},\"DataReduction\":{\"DataVolume\":3,\"Primary\":{\"Window\":{\"Count\":500}}},\"Aggregates\":[{\"Select\":3,\"Aggregations\":[{\"Min\":{}},{\"Max\":{}}]}],\"SuppressedJoinPredicates\":[8],\"Version\":1}}}]}","QueryId":"","ApplicationContext":{"DatasetId":"6427f3c6-42f6-4287-b061-c31c1d2e7ae0","Sources":[{"ReportId":"6e442642-8594-4894-bc32-0ab7f4620772"}]}}],"cancelQueries":[],"modelId":1226835}
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36'
def start_requests(self):
yield scrapy.Request(
url='https://wabi-australia-southeast-api.analysis.windows.net/public/reports/querydata?synchronous=true',
method='POST',
body=json.dumps(self.payload),
headers={
'Accept-Language': 'pt-BR,pt;q=0.9,en;q=0.8',
'ActivityId': '1d3ecdc2-5dc0-801e-4140-82a258f127a6',
'Connection': 'keep-alive',
'Content-Length': '3462',
'Content-Type': 'application/json;charset=UTF-8',
'Host': 'wabi-australia-southeast-api.analysis.windows.net',
'Origin': 'https://app.powerbi.com',
'Referer': 'https://app.powerbi.com/view?r=eyJrIjoiMGIwNTY2MjgtMzJhYy00MzEwLTk5MDAtYTI2MGVlMzk1NjM2IiwidCI6IjZmMGU5YzQyLTk2Y2UtNDU1MS05NzAxLWJhMzFkMGQ2ZDE5ZSJ9',
'RequestId': '11c18fe6-00da-7df4-952c-98ba7bdf188e',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'cross-site',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36',
'X-PowerBI-ResourceKey': '0b056628-32ac-4310-9900-a260ee395636'
}
)
def parse(self, response):
items = json.loads(response.text)
yield {"data":items}
The request in your screenshot is a GET request.
The behaviour of this website is very interesting!
Let's examine it.
By looking at the network panel we can see that GET request is being made to some complex url with many various headers. However It seems that the header X-PowerBI-ResourceKey is the only one that's needed and it controls what content the request will return.
So all we need to replicate this is find the X-PowerBI-ResourceKey value.
If you take a look at the source code of the html page:
https://app.powerbi.com/view?r=eyJrIjoiMGIwNTY2MjgtMzJhYy00MzEwLTk5MDAtYTI2MGVlMzk1NjM2IiwidCI6IjZmMGU5YzQyLTk2Y2UtNDU1MS05NzAxLWJhMzFkMGQ2ZDE5ZSJ9
Here we can see that javascript's atob method is used on url parameter. This is javascripts b64decode function. We can run it in python:
$ ptpython
>>> from base64 import b64decode
>>> b64decode("eyJrIjoiMGIwNTY2MjgtMzJhYy00MzEwLTk5MDAtYTI2MGVlMzk1NjM2IiwidCI6IjZmMGU5YzQyLTk2Y2UtNDU1MS05NzAxLWJhMzF
1 kMGQ2ZDE5ZSJ9")
b'{"k":"0b056628-32ac-4310-9900-a260ee395636","t":"6f0e9c42-96ce-4551-9701-ba31d0d6d19e"}'
We got it figured out! Now lets put everything together in our crawler:
import json
from base64 import b64decode
from w3lib.url import url_query_parameter
def parse(self, response):
url = "https://app.powerbi.com/view?r=eyJrIjoiMGIwNTY2MjgtMzJhYy00MzEwLTk5MDAtYTI2MGVlMzk1NjM2IiwidCI6IjZmMGU5YzQyLTk2Y2UtNDU1MS05NzAxLWJhMzFkMGQ2ZDE5ZSJ9"
# get the "r" paremeter from url
resource_key = url_query_parameter(url, 'r')
# base64 decode it
resource_key = b64decode(resource_key)
# {'k': '0b056628-32ac-4310-9900-a260ee395636', 't': '6f0e9c42-96ce-4551-9701-ba31d0d6d19e'}
# it's a json string - load it and get key "k"
resource_key = json.loads(resource_key)['k']
headers = {
'Accept': "application/json, text/plain, */*",
# 'X-PowerBI-ResourceKey': "0b056628-32ac-4310-9900-a260ee395636",
'X-PowerBI-ResourceKey': resource_key,
'User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36",
'Accept-Encoding': "gzip, deflate, br",
'Accept-Language': "en-US,en;q=0.9,lt;q=0.8,et;q=0.7,de;q=0.6",
}
yield Request(url, headers=headers)
Related
I'm trying to log in to a website through a python script that I've created using the requests module. I've issued a post HTTP request with appropriate parameters and headers to the server, but for some reason I get a different response from that site compared to what I see in dev tools. The status is always 200, though. There is also a get request in place within the script that should fetch the credentials once the login is successful. Currently, it throws a JSONDecodeError on the last line.
import requests
link = 'https://propwire.com/login'
check_url = 'https://propwire.com/search'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
'x-requested-with': 'XMLHttpRequest',
'referer': 'https://propwire.com/login',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9,bn;q=0.8',
'origin': 'https://propwire.com',
}
payload = {"email":"some-email","password":"password","remember":"true"}
with requests.Session() as s:
r = s.get(link)
headers['x-xsrf-token'] = r.cookies['XSRF-TOKEN'].rstrip('%3D')
s.headers.update(headers)
s.post(link,json=payload)
res = s.get(check_url)
print(res.json()['props']['auth'])
Here's an example link I'm trying to scrape: https://www.lowes.com/pd/ZLINE-KITCHEN-BATH-Professional-7-Burners-4-cu-ft-2-cu-ft-Double-Oven-Convection-Dual-Fuel-Range-Stainless-Steel-Common-48-in-Actual-48-in/1000514227
My scraper was working fine till today so I'm guessing Lowe's added more protection against bots :(
After some research, I found that I would have to add headers to my web scraper so I can emulate a real user.
Opened up Dev Console -> Network -> XHR/Fetch -> Found JSON File.
Here's my scrapy script
# -*- coding: utf-8 -*-
import scrapy
from ..items import LowesItem
import re
import pandas as pd
import requests
import json
from scrapy.http import Request
from datetime import date
class LowesSpider(scrapy.Spider):
name = 'Lowes'
def start_requests(self):
HEADERS = {
'method': 'GET',
'scheme': 'https',
'authority': 'content.syndigo.com',
'Accept': '*/*',
'Content-Type': 'text/plain',
'Origin': 'https://lowes.com',
'Accept-Language': 'en-US,en;q=0.9',
'Host': 'content.syndigo.com',
'User-Agent': ' Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.4 Safari/605.1.15',
'Referer': 'https://www.lowes.com/',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Cookie': 'sn=0321'
}
start_urls = ['https://www.lowes.com/pd/ZLINE-KITCHEN-BATH-Professional-7-Burners-4-cu-ft-2-cu-ft-Double-Oven-Convection-Dual-Fuel-Range-Stainless-Steel-Common-48-in-Actual-48-in/1000514227']
for url in start_urls:
yield Request(url,
headers=HEADERS,
meta={'dont_merge_cookies': True,
'url':url})
def parse(self, response):
for item in self.parseLowes(response):
yield item
pass
def parseLowes(self, response):
item = LowesItem() #items from items.py
script_tag = response.xpath('//script[#type="application/ld+json"]/text()').get() #get js container
productPrice = json.loads(script_tag)[2]["offers"]["price"]
productURL = response.url
url = response.meta['url']
productSKU = url.split("=")[-1]
scrapedDate = date.today()
#item['productName'] = productName #display product name
item['productOMS'] = productSKU
item['productPrice'] = productPrice #display price and assign to variable
item['productURL'] = productURL #displayURL
item['scrapedDate'] = scrapedDate
yield item
When I run scrapy, I get 400 as a response from the command.
From what I can see about the network connection, the issue is related to the their CDN (Akamai) which is blocking the access.
I was able to access your link and see the product from Microsoft Edge (version 107). In my request the user agent is:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36 Edg/107.0.1418.26
So, try to modify in your code the 'User-Agent' key with that value.
I'm trying to create a script using scrapy to grab json content from this webpage. I've used headers within the script accordingly but when I run it, I always end up getting JSONDecodeError. The site sometimes throws captcha but not always. However, I've never got any success using the script below even when I used vpn. How can I fix it?
This is how I've tried:
import scrapy
import urllib
class ImmobilienScoutSpider(scrapy.Spider):
name = "immobilienscout"
start_url = "https://www.immobilienscout24.de/Suche/de/nordrhein-westfalen/wohnung-kaufen"
headers = {
'accept': 'application/json; charset=utf-8',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'x-requested-with': 'XMLHttpRequest',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36',
}
params = {
'price': '1000.0-',
'constructionyear': '-2000',
'pagenumber': '1'
}
def start_requests(self):
req_url = f'{self.start_url}?{urllib.parse.urlencode(self.params)}'
yield scrapy.Request(
url=req_url,
headers=self.headers,
callback=self.parse,
)
def parse(self,response):
yield {"response":response.json()}
This is how the output should look like (truncated):
{"searchResponseModel":{"additional":{"lastSearchApiUrl":"/region?realestatetype=apartmentbuy&price=1000.0-&constructionyear=-2000&pagesize=20&geocodes=1276010&pagenumber=1","title":"Eigentumswohnung in Nordrhein-Westfalen - ImmoScout24","sortingOptions":[{"description":"Standardsortierung","code":0},{"description":"Kaufpreis (höchste zuerst)","code":3},{"description":"Kaufpreis (niedrigste zuerst)","code":4},{"description":"Zimmeranzahl (höchste zuerst)","code":5},{"description":"Zimmeranzahl (niedrigste zuerst)","code":6},{"description":"Wohnfläche (größte zuerst)","code":7},{"description":"Wohnfläche (kleinste zuerst)","code":8},{"description":"Neubau-Projekte (Projekte zuerst)","code":31},{"description":"Aktualität (neueste zuerst)","code":2}],"pagerTemplate":"|Suche|de|nordrhein-westfalen|wohnung-kaufen?price=1000.0-&constructionyear=-2000&pagenumber=%page%","sortingTemplate":"|Suche|de|nordrhein-westfalen|wohnung-kaufen?price=1000.0-&constructionyear=-2000&sorting=%sorting%","world":"LIVING","international":false,"device":{"deviceType":"NORMAL","devicePlatform":"UNKNOWN","tablet":false,"mobile":false,"normal":true}
EDIT:
This is how the script built upon requests module looks like:
import requests
link = 'https://www.immobilienscout24.de/Suche/de/nordrhein-westfalen/wohnung-kaufen'
headers = {
'accept': 'application/json; charset=utf-8',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'x-requested-with': 'XMLHttpRequest',
'content-type': 'application/json; charset=utf-8',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36',
'referer': 'https://www.immobilienscout24.de/Suche/de/nordrhein-westfalen/wohnung-kaufen?price=1000.0-&constructionyear=-2000&pagenumber=1',
# 'cookie': 'hardcoded cookies'
}
params = {
'price': '1000.0-',
'constructionyear': '-2000',
'pagenumber': '2'
}
sess = requests.Session()
sess.headers.update(headers)
resp = sess.get(link,params=params)
print(resp.json())
Scrapy's CookiesMiddleware disregards 'cookie' passed in headers.
Reference: scrapy/scrapy#1992
Pass cookies explicitly:
yield scrapy.Request(
url=req_url,
headers=self.headers,
callback=self.parse,
# Add the following line:
cookies={k: v.value for k, v in http.cookies.SimpleCookie(self.headers.get('cookie', '')).items()},
),
Note: That site uses GeeTest CAPTCHA, which cannot be solved by simply rendering the page or using Selenium, so you still need to periodically update the hardcoded cookie (cookie name: reese84) taken from the browser, or use a service like 2Captcha.
The response web page is as below when to slect title and input wordpress.
Here is my python code to pass arguments for get method with python3.
import urllib.request
import urllib.parse
url = 'http://www.it-ebooks.info/'
values = {'q': 'wordpress','type': 'title'}
data = urllib.parse.urlencode(values).encode(encoding='utf-8',errors='ignore')
headers = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0' }
request = urllib.request.Request(url=url, data=data,headers=headers,method='GET')
response = urllib.request.urlopen(request)
buff = response.read()
html = buff.decode("utf8")
print(html)
I can't get the desired output web page.
How to pass arguments for get method with urllib in my example?
The data kwarg of urllib.request.Request is only used for POST requests as it modifies the request's body.
GET requests simply use URL parameters, so you should append these to the url:
params = '?q=wordpress&type=title'
url = 'http://www.it-ebooks.info/search/{}'.format(params)
You can of course take the time and generalize this into a generic function.
is better if you use the library called requests
import requests
headers = {
'DNT': '1',
'Accept-Encoding': 'gzip, deflate, sdch',
'Accept-Language': 'es-ES,es;q=0.8,en;q=0.6',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Referer': 'http://www.it-ebooks.info/',
'Connection': 'keep-alive',
}
r = requests.get('http://www.it-ebooks.info/search/?q=wordpress&type=title', headers=headers)
print r.content
I change the default request headers in settings.py as below:
DEFAULT_REQUEST_HEADERS = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.84 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, sdch',
'Accept-Language': 'en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4',
}
However, it doesn't work in my HotSpider. I can see scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware was enabled, but connection was closed cleanly as if the headers were not set.
Here is the HotSpider:
# -*- coding: utf-8 -*-
import scrapy
class HotSpider(scrapy.Spider):
name = "hot"
allowed_domains = ["qiushibaike.com"]
start_urls = (
'http://www.qiushibaike.com/hot',
)
def parse(self, response):
print '\n', response.status, '\n'
If I change the code to override the make_requests_from_url to set the header, everything works well.
# -*- coding: utf-8 -*-
import scrapy
class HotSpider(scrapy.Spider):
name = "hot"
allowed_domains = ["qiushibaike.com"]
start_urls = (
'http://www.qiushibaike.com/hot',
)
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.84 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, sdch',
'Accept-Language': 'en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4',
}
def make_requests_from_url(self, url):
return scrapy.http.Request(url, headers=self.headers)
def parse(self, response):
print '\n', response.status, '\n'
This problem will be settled in Scrapy 1.2 according to prioritize default headers over user agent middlewares #2091
I see User-Agent header is indeed not set properly when using default headers middleware and this particular site refuses connections without some expected user-agent header.
Recommended way to set user-agent for your crawler is by using USER_AGENT setting key:
e.g.
# settings.py
USER_AGENT = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36"
not setting user-agent when using default headers might be some bug in Scrapy, or maybe this is expected and documented somewhere. You need to do more research about this, if it is indeed bug it's worth posting bug report in Scrapy github repo.