302 Redirect in xhr request - python

i need to send a post request to this url :
http://lastsecond.ir/hotels/ajax
you can see the other parameters send by this request here:
formdata:
filter_score:
sort:reviewed_at
duration:0
page:1
base_location_id:1
request header:
:authority:lastsecond.ir
:method:POST
:path:/hotels/ajax
:scheme:https
accept:*/*
accept-encoding:gzip, deflate, br
accept-language:en-US,en;q=0.9,fa;q=0.8,ja;q=0.7
content-length:67
content-type:application/x-www-form-urlencoded; charset=UTF-8
cookie:_jsuid=2453861291; read_announcements=,11,11; _ga=GA1.2.2083988810.1511607903; _gid=GA1.2.1166842676.1513922852; XSRF-TOKEN=eyJpdiI6IlZ2TklPcnFWU3AzMlVVa0k3a2xcL2dnPT0iLCJ2YWx1ZSI6ImVjVmt2c05STWRTUnJod1IwKzRPNk4wS2lST0k1UTk2czZwZXJxT2FQNmppNkdUSFdPK29kU29RVHlXbm1McTlFSlM5VlIwbGNhVUozbXFBbld5c2tRPT0iLCJtYWMiOiI4YmNiMGQwMzdlZDgyZTE2YWNlMWY1YjdmMzViNDQwMmRjZGE4YjFmMmM1ZmUyNTQ0NmE1MGRjODFiNjMwMzMwIn0%3D; lastsecond-session=eyJpdiI6ImNZQjdSaHhQM1lZaFJIZzhJMWJXN0E9PSIsInZhbHVlIjoiK1NWdHJiUTdZQzBYeEsyUjE3QXFhUGJrQXBGcExDMVBXTjhpSVJLRlFnUjVqXC9USHBxNGVEZ3dwKzVGcG5yeU93VTZncG9wRGpvK0VpVnQ2b1ByVnh3PT0iLCJtYWMiOiI4NTFkYmQxZTFlMTMxOWFmZmU1ZjA1ZGZhNTMwNDFmZmU0N2FjMGVjZTg1OGU2NGE0YTNmMTc2MDA5NWM1Njg3In0%3D
origin:https://lastsecond.ir
referer:https://lastsecond.ir/hotels?score=&page=1&sort=reviewed_at&duration=0
user-agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36
x-csrf-token:oMpQTG0wN0YveJIk2WhkesvzjZE2FqHkDqPiW8Dy
x-requested-with:XMLHttpRequest
the result of this code suppose to be a json file, by it redirect the request to the its parent url. i'm using scrapy with python to send this request, here is scrapy code :
class HotelsSpider(scrapy.Spider):
name = 'hotels'
allowed_domains = ['lastsecond.ir']
start_urls = ['http://lastsecond.ir/hotels']
def parse(self, response):
data = {
'filter_score': '',
'sort': 'reviewed_at',
'duration': '0',
'page': '1',
'base_location_id': '1'
}
headers = {
'user-agent': 'Mozilla/5.0',
'x-csrf-token': 'oMpQTG0wN0YveJIk2WhkesvzjZE2FqHkDqPiW8Dy',
'x-requested-with': 'XMLHttpRequest'
}
url = 'https://lastsecond.ir/hotels/ajax'
return FormRequest(
url=url,
callback=self.parse_details,
formdata=data,
method="POST",
headers=headers,
dont_filter=True
)
def parse_details(self, response):
data = response.body_as_unicode()
print(data)
#f = open('output.json', 'w')
#f.write(data)
#f.close()
i've changed my code so it gets the fresh csrf-token everytime it sends a request:
class HotelsSpider(scrapy.Spider):
name = 'hotels'
allowed_domains = ['lastsecond.ir']
start_urls = ['http://lastsecond.ir/hotels']
def parse(self, response):
html = response.body_as_unicode()
start = html.find("var csrftoken = '")
start = start + len(b"var csrftoken = '")
end = html.find("';", start)
self.csrftoken = html[start:end]
print('csrftoken:', self.csrftoken)
yield self.ajax_request('1')
def ajax_request(self, page):
data = {
'filter_score': '',
'sort': 'reviewed_at',
'duration': '0',
'page': page,
'base_location_id': '1'
}
headers = {
'user-agent': 'Mozilla/5.0',
'x-csrf-token': self.csrftoken,
'x-requested-with': 'XMLHttpRequest'
}
url = 'https://lastsecond.ir/hotels/ajax'
return FormRequest(
url=url,
callback=self.parse_details,
formdata=data,
method="POST",
headers=headers,
dont_filter=True
)
def parse_details(self, response):
print(response.body_as_unicode())
any help would be appreciated.

Your mistake is the same 'x-csrf-token' in every request.
'x-csrf-token' is method to block bots/scripts.
Wikipedia: Cross Site Request Forgery
Every time you open page in browser portal generates new, uniqe 'x-csrf-token' which can be correct only for short time. You can't use the same 'x-csrf-token' all the time.
In answer to previous question I make GET request to get page and to find fresh X-CSRF-TOKEN.
See self.csrftoken in code
def parse(self, response):
print('url:', response.url)
html = response.body_as_unicode()
start = html.find("var csrftoken = '")
start = start + len(b"var csrftoken = '")
end = html.find("';" , start)
self.csrftoken = html[start:end]
print('csrftoken:', self.csrftoken)
yield self.create_ajax_request('1')
And later I use this token to read AJAX requests.
def create_ajax_request(self, page):
'''
subfunction can't use `yield, it has to `return` Request to `parser`
and `parser` can use `yield`
'''
print('yield page:', page)
url = 'https://lastsecond.ir/hotels/ajax'
headers = {
'X-CSRF-TOKEN': self.csrftoken,
'X-Requested-With': 'XMLHttpRequest',
}
params = {
'filter_score': '',
'sort': 'reviewed_at',
'duration': '0',
'page': page,
'base_location_id': '1',
}
return scrapy.FormRequest(url,
callback=self.parse_details,
formdata=params,
headers=headers,
dont_filter=True,
)

Are you making an Illegal Request? The easiest way to learn it is to copy the request in the browser as curl (F12 -> Network -> Right Click on specify request -> Copy -> Copy as Curl), and convert it to python language with this tool (without Scrapy)

Related

Scraping with AJAX - how to obtain the data?

I am trying to scrape the data from https://www.anre.ro/ro/info-consumatori/comparator-oferte-tip-de-furnizare-a-gn, which gets its input via Ajax (request URL is https://www.anre.ro/ro/ajax/comparator/get_results_gaz).
However, I can see that the Form Data is in a form of - tip_client=casnic&modalitate_racordare=sistem_de_distributie&transee_de_consum=b1&tip_pret_unitar=cu_reglementate&id_judet=ALBA&id_siruta=1222&consum_mwh=&pret_furnizare_mwh=&componenta_fixa=&suplimentar_componenta_fixa=&termen_plata=&durata_contractului=&garantii=&frecventa_emitere_factura=&tip_pret= (if I view source in Chrome). How do I pass this to scrapy or any other module to retrieve the desired webpage?
So far, I have this (is the json format correct considering the Form Data?):
class ExSpider(scrapy.Spider):
name = 'ExSpider'
allowed_domains = ['anre.ro']
def start_requests(self):
params = {
"tip_client":"casnic",
"modalitate_racordare":"sistem_de_distributie",
"transee_de_consum":"b1",
"tip_pret_unitar":"cu_reglementate",
"id_judet":"ALBA",
"id_siruta":"1222",
"consum_mwh":"",
"pret_furnizare_mwh":"",
"componenta_fixa":"",
"suplimentar_componenta_fixa":"",
"termen_plata":"",
"durata_contractului":"",
"garantii":"",
"frecventa_emitere_factura":"",
"tip_pret":""
}
r = scrapy.FormRequest('https://www.anre.ro/ro/ajax/comparator/get_results_gaz', method = "POST",formdata=params)
print(r)
The following should produce the required response from that page you wish to grab data from.
class ExSpider(scrapy.Spider):
name = "exspider"
url = 'https://www.anre.ro/ro/ajax/comparator/get_results_gaz'
payload = {
'tip_client': 'casnic',
'modalitate_racordare': 'sistem_de_distributie',
'transee_de_consum': 'b2',
'tip_pret_unitar': 'cu_reglementate',
'id_judet': 'ALBA',
'id_siruta': '1222',
'consum_mwh': '',
'pret_furnizare_mwh': '',
'componenta_fixa': '',
'suplimentar_componenta_fixa': '',
'termen_plata': '',
'durata_contractului': '',
'garantii': '',
'frecventa_emitere_factura': '',
'tip_pret': ''
}
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
'Referer': 'https://www.anre.ro/ro/info-consumatori/comparator-oferte-tip-de-furnizare-a-gn'
}
def start_requests(self):
yield scrapy.FormRequest(
self.url,
formdata=self.payload,
headers=self.headers,
callback=self.parse
)
def parse(self, response):
print(response.text)

How to get a correct session_id? (Scrapy, Python)

There is an url: https://maps.leicester.gov.uk/map/Aurora.svc/run?inspect_query=QPPRN&inspect_value=ROH9385&script=%5CAurora%5Cw3%5CPLANNING%5Cw3PlanApp_MG.AuroraScript%24&nocache=f73eee56-45da-f708-87e7-42e82982370f&resize=always
It returns the coordinates. To get the coordinates - it does 3 requests(I SUPPOSE):
the url mentioned above
requesting session_id
getting coordinates using previousely mentioned session_id.
I am getting session_id in the 2nd step, but it is wrong. I can't get coordinates in step 3 using it. How can I know that the problem is in session_id? When I insert the session_id taken from the browser - my code works fine and coordinates are received.
Here are the requests in browser:
Here is the correct response from browser:
And this is what I'm getting with my code:
Here is my code (it is for Scrapy framework):
'''
import inline_requests
#inline_requests.inline_requests
def get_map_data(self, response):
""" Getting map data. """
map_referer = ("https://maps.leicester.gov.uk/map/Aurora.svc/run?inspect_query=QPPRN&"
"inspect_value=ROH9385&script=%5CAurora%5Cw3%5CPLANNING%5Cw3PlanApp_MG.AuroraScript"
"%24&nocache=f73eee56-45da-f708-87e7-42e82982370f&resize=always")
response = yield scrapy.Request(
url=map_referer,
meta=response.meta,
method='GET',
dont_filter=True,
)
time_str = str(int(time.time()*1000))
headers = {
'Referer': response.url,
'Accept': 'application/javascript, */*; q=0.8',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7',
'Host': 'maps.leicester.gov.uk',
'Sec-Fetch-Dest': 'script',
'Sec-Fetch-Mode': 'no-cors',
'Sec-Fetch-Site': 'same-origin',
'Connection': 'keep-alive',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36'
}
response.meta['handle_httpstatus_all'] = True
url = ( 'https://maps.leicester.gov.uk/map/Aurora.svc/RequestSession?userName=inguest'
'&password=&script=%5CAurora%5Cw3%5CPLANNING%5Cw3PlanApp_MG.AuroraScript%24&'
f'callback=_jqjsp&_{time_str}=' )
reqest_session_response = yield scrapy.Request(
url=url,
meta=response.meta,
method='GET',
headers=headers,
dont_filter=True,
)
session_id = re.search(r'"SessionId":"([^"]+)', reqest_session_response.text)
session_id = session_id.group(1) if session_id else None
print(8888888888888)
print(session_id)
# session_id = '954f04e2-e52c-4dd9-9046-f3f013d3f633'
# pprn = item.get('other', {}).get('PPRN')
pprn = 'ROH9385' # hard coded for the current page
if session_id and pprn:
time_str = str(int(time.time()*1000))
url = ('https://maps.leicester.gov.uk/map/Aurora.svc/FindValue'
f'Location?sessionId={session_id}&value={pprn}&query=QPPRN&callback=_jqjsp'
f'&_{time_str}=')
coords_response = yield scrapy.Request(
url = url,
method='GET',
meta=reqest_session_response.meta,
dont_filter = True,
)
print(coords_response.text)
breakpoint()'''
Could you please correct my code so that it could get coordinates?
The website creates a sessionId first, then use the sessionId creates a layer on server (I guess). Then you can start requesting, otherwise it can't find the map layer under that sessionId.
import requests
url = "https://maps.leicester.gov.uk/map/Aurora.svc/RequestSession?userName=inguest&password=&script=%5CAurora%5Cw3%5CPLANNING%5Cw3PlanApp_MG.AuroraScript%24"
res = requests.get(url, verify=False).json()
sid = res["Session"]["SessionId"]
url = f"https://maps.leicester.gov.uk/map/Aurora.svc/OpenScriptMap?sessionId={sid}"
res = requests.get(url, verify=False)
url = f"https://maps.leicester.gov.uk/map/Aurora.svc/FindValueLocation?sessionId={sid}&value=ROH9385&query=QPPRN"
res = requests.get(url, verify=False).json()
print(res)

Logging to tricky site with python

I am trying to scrape servers list from https://www.astrill.com/member-zone/tools/vpn-servers which is for members only. Username, password and captcha are required. Everything works if I login with browser and copy 'PHPSESSID' cookie, but I want to log in with Python. I am downloading capthca and enter it manually. But anyway I am not able to login. Login URL: https://www.astrill.com/member-zone/log-in
Could anybody help me, please?
SERVERS_URL = 'https://www.astrill.com/member-zone/tools/vpn-servers'
LOGIN_URL = 'https://www.astrill.com/member-zone/log-in'
def get_capcha(url):
print(f'Scraping url: {url}')
try:
response = requests.get(url)
response.raise_for_status()
except Exception as e:
print(type(e), e)
if response.status_code == 200:
print('Success!')
page = response.content
soup = bs4.BeautifulSoup(page, 'html.parser')
captcha_url = (soup.find('img', alt='captcha')['src'])
captcha_file = os.path.join(BASE_FOLDER, 'captcha.jpg')
id = soup.find(id='csrf_token')
print(id['value'])
print(f'Captcha: {captcha_url}')
print(response.headers)
urlretrieve(captcha_url, captcha_file)
return id['value']
def login(url, id):
captcha_text = input('Captcha: ')
print(id)
payload = {
'action': 'log-in',
'username': 'myusername#a.com',
'password': '1111111',
'captcha': captcha_text,
'_random': 'l4r1b7hf4g',
'csrf_token': id
}
session = requests.session()
post = session.post(url, data=payload)
r = session.get(SERVERS_URL)
print(r.text)
print(r.cookies)
if __name__ == '__main__':
id = get_capcha(LOGIN_URL)
login(LOGIN_URL, id)
First of all I was not sure about payload fields to POST. They can be easily discovered with Firefox Developer Tools - Network. You can find what does your browser actually post there. Second thing which I discovered was that I need to request capthca file within the session with my headers and cookies. So my code looks like following now and it works! (probably some header fields can be removed)
cookies = {}
headers = {
'Host': 'www.astrill.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101 Firefox/76.0',
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Language': 'ru-RU,ru;q=0.8,en-US;q=0.5,en;q=0.3',
'Accept-Encoding': 'gzip, deflate, br',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'X-Requested-With': 'XMLHttpRequest',
'Content-Length': '169',
'Origin': 'https://www.astrill.com',
'Connection': 'keep-alive',
'Referer': 'https://www.astrill.com/member-zone/log-in',
}
payload = {
'action': 'log-in',
'username': 'myusername#a.com',
'password': '1111111',
'remember_me': 0,
'captcha': '',
'_random': 'somerandom1',
'csrf_token': ''
}
def get_capcha(url):
print(f'Scraping url: {url}')
try:
response = session.get(url)
response.raise_for_status()
except Exception as e:
print(type(e), e)
if response.status_code == 200:
print('Success!')
page = response.content
soup = bs4.BeautifulSoup(page, 'html.parser')
captcha_url = (soup.find('img', alt='captcha')['src'])
captcha_file = os.path.join(BASE_FOLDER, 'captcha.jpg')
payload['csrf_token'] = soup.find(id='csrf_token')['value']
print(f'csrf_token: {payload["csrf_token"]}')
print(f'Captcha: {captcha_url}')
cookies.update(response.cookies)
captcha_img = session.get(captcha_url, headers=headers, cookies=cookies)
file = open(captcha_file, "wb")
file.write(captcha_img.content)
file.close()
payload['captcha'] = input('Captcha: ')
return
def login(url):
post = session.post(url, data=payload, headers=headers, cookies=cookies)
print(post.text)
r = session.get(SERVERS_URL, cookies=cookies)
print(r.text)
print(r.cookies)
def main():
get_capcha(LOGIN_URL)
login(LOGIN_URL)
if __name__ == '__main__':
main()

SCRAPY: Every time my spider crawls, it is scraping the same page (first page)

I have written a code to scrape through a page using Scrapy in Python. Bellow I have pasted the main.py code. But, whenever I run my spider, it scrapes only from the first page ( DEBUG: Scraped from <200 https://www.tuscc.si/produkti/instant-juhe>), which is also the Referer in the Request Headers (when inspected).
I have tried adding the source of the "Request Payload" field data, which is pasted here: {"action":"loadList","skip":64,"filter":{"1005":[],"1006":[],"1007":[],"1009":[],"1013":[]}}, and when I try to open the page with it (modified in this lookout:
https://www.tuscc.si/produkti/instant-juhe#32;'action':'loadList';'skip':'32';'sort':'none'
), the browser opens it. But the scrapy shell doesn't. I have also tried adding the numbers from the Request URL: https://www.tuscc.si/cache/script/tuscc.js?1563872492384, where the query string parameters are 1563872492384; but it still won't scrape from the requested page.
Also, I have tried many variations and added many stuff, all which i have read online just to see if there will be progress, but none....
The code is:
from scrapy.spiders import CrawlSpider
from tus_pomos.items import TusPomosItem
from tus_pomos.scrapy_splash import SplashRequest
class TusPomosSpider(CrawlSpider):
name = 'TUSP'
allowed_domains = ['www.tuscc.si']
start_urls = ["https://www.tuscc.si/produkti/instant-juhe#0;1563872492384;",
"https://www.tuscc.si/produkti/instant-juhe#64;1563872492384;", ]
download_delay = 5.0
def start_requests(self):
# payload = [
# {"action": "loadList",
# "skip": 0,
# "filter": {
# "1005": [],
# "1006": [],
# "1007": [],
# "1009": [],
# "1013": []}
# }]
for url in self.start_urls:
r = SplashRequest(url, self.parse, magic_response=False, dont_filter=True, endpoint='render.json', meta={
'original_url': url,
'dont_redirect': True},
args={
'wait': 2,
'html': 1
})
r.meta['dont_redirect'] = True
yield r
def parse(self, response):
items = TusPomosItem()
pro = response.css(".thumb-box")
for p in pro:
pro_link = p.css("a::attr(href)").extract_first()
pro_name = p.css(".description::text").extract_first()
items['pro_link'] = pro_link
items['pro_name'] = pro_name
yield items
In conclusion, I am requesting to crawl all the pages from the pagination, for example this page (I also tried with command scrapy shell url):
https://www.tuscc.si/produkti/instant-juhe#64;1563872492384;
But the response is always the first page, and it is scraping it repeatedly:
https://www.tuscc.si/produkti/instant-juhe
I would be grateful if you help me. Thanks
THE PARSE_DETAILS GENERATOR FUNCTION
def parse_detail(self, response):
items = TusPomosItem()
pro = response.css(".thumb-box")
for p in pro:
pro_link = p.css("a::attr(href)").extract_first()
pro_name = p.css(".description::text").extract_first()
items['pro_link'] = pro_link
items['pro_name'] = pro_name
my_details = {
'pro_link': pro_link,
'pro_name': pro_name
}
with open('pro_file.json', 'w') as json_file:
json.dump(my_details, json_file)
yield items
# yield scrapy.FormRequest(
# url='https://www.tuscc.si/produkti/instant-juhe',
# callback=self.parse_detail,
# method='POST',
# headers=self.headers
# )
Here I am not sure whether I should assign my „items“ variable the way it is, or get it from the response.body?
Also, should the yield be the way it is, or should I change it with an Request(which is more than partially copied by the ANSWER code given)?
I am new here, so thanks for the understanding!
Instead of using Splash to render the pages, it's probably more efficient to get the data from the underlying requests that are made.
The below piece of code goes through all pages with articles. Under parse_detail, you can write the logic to load the data from the response into a json, in which you can find the 'pro_link' and 'pro_name' of the products.
import scrapy
import json
from scrapy.spiders import Spider
from ..items import TusPomosItem
class TusPomosSpider(Spider):
name = 'TUSP'
allowed_domains = ['tuscc.si']
start_urls = ["https://www.tuscc.si/produkti/instant-juhe"]
download_delay = 5.0
headers = {
'Origin': 'https://www.tuscc.si',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-GB,en;q=0.9,nl-BE;q=0.8,nl;q=0.7,ro-RO;q=0.6,ro;q=0.5,en-US;q=0.4',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36',
'Content-Type': 'application/json; charset=UTF-8',
'Accept': 'application/json, text/javascript, */*; q=0.01',
'X-Requested-With': 'XMLHttpRequest',
'Connection': 'keep-alive',
'Referer': 'https://www.tuscc.si/produkti/instant-juhe',
}
def parse(self, response):
number_of_pages = int(response.xpath(
'//*[#class="paginationHolder"]//#data-size').extract_first())
number_per_page = int(response.xpath(
'//*[#name="pageSize"]/*[#selected="selected"]/text()').extract_first())
for page_number in range(0, number_of_pages):
skip = number_per_page * page_number
data = {"action": "loadList",
"filter": {"1005": [], "1006": [], "1007": [], "1009": [],
"1013": []},
"skip": str(skip),
"sort": "none"
}
yield scrapy.Request(
url='https://www.tuscc.si/produkti/instant-juhe',
callback=self.parse_detail,
method='POST',
body=json.dumps(data),
headers=self.headers
)
def parse_detail(self, response):
detail_page = json.loads(response.text)
for product in detail_page['docs']:
item = TusPomosItem()
item['pro_link'] = product['url']
item['pro_name'] = product['title']
yield item

Scrapy - Simulating AJAX requests with headers and request payload

https://www.kralilan.com/liste/kiralik-bina
This is the website I am trying to scrape. When you open the website, the listings are generated with an ajax request. The same request keeps populating page whenever you scroll down. This is how they implemented infinite scrolling...
I found out this is the request sent to the server when I scroll down and I tried to simulate the same request with headers and request payload. This is my spider.
class MySpider(scrapy.Spider):
name = 'kralilanspider'
allowed_domains = ['kralilan.com']
start_urls = [
'https://www.kralilan.com/liste/satilik-bina'
]
def parse(self, response):
headers = {'Referer': 'https://www.kralilan.com/liste/kiralik-bina',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0',
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
#'Content-Type': 'application/json; charset=utf-8',
#'X-Requested-With': 'XMLHttpRequest',
#'Content-Length': 246,
#'Connection': 'keep-alive',
}
yield scrapy.Request(
url='https://www.kralilan.com/services/ki_operation.asmx/getFilter',
method='POST',
headers=headers,
callback=self.parse_ajax
)
def parse_ajax(self, response):
yield {'data': response.text}
If I uncomment the commented headers, request fails with status code 400 or 500.
I tried to send request payload as a body in the parse method. That didn't work either.
If I try to yield response.body, I get TypeError: Object of type bytes is not JSON serializable.
What am I missing here?
The following implementation will fetch you the response you would like to grab. You missed the most important part data to pass as a parameter in your post requests.
import json
import scrapy
class MySpider(scrapy.Spider):
name = 'kralilanspider'
data = {'incomestr':'["Bina","1",-1,-1,-1,-1,-1,5]', 'intextstr':'{"isCoordinates":false,"ListDrop":[],"ListText":[{"id":"78","Min":"","Max":""},{"id":"107","Min":"","Max":""}],"FiyatData":{"Max":"","Min":""}}', 'index':0 , 'count':'10' , 'opt':'1' , 'type':'3'}
def start_requests(self):
yield scrapy.Request(
url='https://www.kralilan.com/services/ki_operation.asmx/getFilter',
method='POST',
body=json.dumps(self.data),
headers={"content-type": "application/json"}
)
def parse(self, response):
items = json.loads(response.text)['d']
yield {"data":items}
In case you wanna parse data from multiple pages (new page index is recorded when you scroll downward), the following will do the trick. The pagination is within index key in your data.
import json
import scrapy
class MySpider(scrapy.Spider):
name = 'kralilanspider'
data = {'incomestr':'["Bina","1",-1,-1,-1,-1,-1,5]', 'intextstr':'{"isCoordinates":false,"ListDrop":[],"ListText":[{"id":"78","Min":"","Max":""},{"id":"107","Min":"","Max":""}],"FiyatData":{"Max":"","Min":""}}', 'index':0 , 'count':'10' , 'opt':'1' , 'type':'3'}
headers = {"content-type": "application/json"}
url = 'https://www.kralilan.com/services/ki_operation.asmx/getFilter'
def start_requests(self):
yield scrapy.Request(
url=self.url,
method='POST',
body=json.dumps(self.data),
headers=self.headers,
meta={'index': 0}
)
def parse(self, response):
items = json.loads(response.text)['d']
res = scrapy.Selector(text=items)
for item in res.css(".list-r-b-div"):
title = item.css(".add-title strong::text").get()
price = item.css(".item-price::text").get()
yield {"title":title,"price":price}
page = response.meta['index'] + 1
self.data['index'] = page
yield scrapy.Request(self.url, headers=self.headers, method='POST', body=json.dumps(self.data), meta={'index': page})
Why do you ignore POST body? You need to submit it too:
def parse(self, response):
headers = {'Referer': 'https://www.kralilan.com/liste/kiralik-bina',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0',
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Content-Type': 'application/json; charset=utf-8',
'X-Requested-With': 'XMLHttpRequest',
#'Content-Length': 246,
#'Connection': 'keep-alive',
}
payload = """
{ incomestr:'["Bina","2",-1,-1,-1,-1,-1,5]', intextstr:'{"isCoordinates":false,"ListDrop":[],"ListText":[{"id":"78","Min":"","Max":""},{"id":"107","Min":"","Max":""}],"FiyatData":{"Max":"","Min":""}}', index:'0' , count:'10' , opt:'1' , type:'3'}
"""
yield scrapy.Request(
url='https://www.kralilan.com/services/ki_operation.asmx/getFilter',
method='POST',
body=payload,
headers=headers,
callback=self.parse_ajax
)

Categories

Resources