scrapy, how to send multiple requests to a form - python

I have working code here. I am sending 1 request to a form, and I am getting back all the data that I need. Code:
def start_requests(self):
nubmers="12345"
submitForm = FormRequest("https://example.com/url",
formdata={'address':numbers,'submit':'Search'},
callback=self.after_submit)
return [submitForm]
Now, I need to send multiple requests through the same form, and collect the data for each request. I need to collect the data for x numbers. I stored all numbers into a file:
12345
54644
32145
12345
code:
def start_requests(self):
with open('C:\spiders\usps\zips.csv') as fp:
for line in fp:
submitForm = FormRequest("https://example.com/url",
formdata={'address':line,
'submit':'Search'},callback=self.after_submit,dont_filter=True)
return [submitForm]
This code works, but it also collects data for the last entry in the file only. I need to collect the data for each row/number in the file. If I try yield instead, it returns scrapy, stops, and throws this error:
if not request.dont_filter and self.df.request_seen(request):
exceptions.AttributeError: 'list' object has no attribute 'dont_filter'

First of all, you definitely need yield to "fire" up multiple requests:
def start_requests(self):
with open('C:\spiders\usps\zips.csv') as fp:
for line in fp:
yield FormRequest("https://domain.com/url",
formdata={'address':line, 'submit':'Search'},
callback=self.after_submit,
dont_filter=True)
Also, you shouldn't enclose the FormRequest into a list, just yield the request.

Related

Scrapy passing requests along

I previously used some code like this to visit a page and change the url around a bit to generate a second request which gets passed to a second parse method:
from scrapy.http import Request
def parse_final_page(self, response):
# do scraping here:
def get_next_page(self, response, new_url):
req = Request(
url=new_url,
callback=self.parse_final_page,
)
yield req
def parse(self, response):
if 'substring' in response.url:
new_url = 'some_new_url'
yield from self.get_next_page(response, new_url)
else:
pass
# continue..
# scraping items
# yield
This snippet is pretty old (2 years or so) and i'm currently using Scrapy 2.2, although i'm not sure if that's relevant. Note that get_next_page gets called, but parse_final_page never runs, which I don't get...
Why is parse_final_page not being called? Or more to the point.. is there an easier way for me to just generate a new request on the fly? I would prefer to not use a middleware or change start_urls in this context.
1 - "Why is parse_final_page not being called?"
Your script works fine for me on Scrapy v2.2.1, so its probably an issue with the specific request you're trying to make.
2 - "...is there an easier way for me to just generate a new request on the fly?"
You could try this variation where you return the request from the get_next_page callback, instead of yielding it (note I removed the from keyword and did not send the response object to the callback):
def parse(self, response):
if 'substring' in response.url:
new_url = ''
yield self.get_next_page(new_url)
else:
# continue..
# scraping items
# yield
def get_next_page(self, new_url):
req = Request(
url=new_url,
callback=self.parse_final_page,
)
return req
def parse_final_page(self, response):
# do scraping here:

Scrapy/Python yield and continue processing possible?

I am trying this sample code
from scrapy.spiders import Spider, Request
import scrapy
class MySpider(Spider):
name = 'toscrapecom'
start_urls = ['http://books.toscrape.com/catalogue/page-1.html']
urls = (
'http://books.toscrape.com/catalogue/page-{}.html'.format(i + 1) for i in range(50)
)
def parse(self, response):
for url in self.urls:
return Request(url)
It crawls all the pages fine. However if I yield an item before the for loop then it crawls only the first page. (as shown below)
from scrapy.spiders import Spider, Request
import scrapy
class MySpider(Spider):
name = 'toscrapecom'
start_urls = ['http://books.toscrape.com/catalogue/page-1.html']
urls = (
'http://books.toscrape.com/catalogue/page-{}.html'.format(i + 1) for i in range(50)
)
def parse(self, response):
yield scrapy.item.Item()
for url in self.urls:
return Request(url)
But I can use yield Request(url) instead of return... and it scrapes the pages backwards from last page to first.
I would like to understand why return does not work anymore once an item is yielded? Can somebody explain this in a simple way?
You ask why the second code does not work, but I don’t think you fully understand why the first code works :)
The for loop of your first code only loops once.
What is happening is:
self.parse() is called for the URL in self.start_urls.
self.parse() gets the first (and only the first!) URL from self.urls, and returns it, exiting self.parse().
When a response for that first URL arrives, self.parse() gets called again, and this time it returns a request (only 1 request!) for the second URL from self.urls, because the previous call to self.parse() already consumed the first URL from it (self.urls is an iterator).
The last step repeats in a loop, but it is not the for loop that does it.
You can change your original code to this and it will work the same way:
def parse(self, response):
try:
return next(self.urls)
except StopIteration:
pass
Because to call items/requests it should be generator function.
You even cannot use yield and return in the same function with the same "meaning", it will raise SyntaxError: 'return' with argument inside generator.
The return is (almost) equivalent to raising StopIteration. In this topic Return and yield in the same function you can find very detailed explanation, with links specification.

Scrapy get pre-redirect url

I've a crawler running without troubles but i need to get the start_url and not the redirected one.
The problem is i'm using rules to pass parameters to the URL ( like field-keywords=xxxxx ) and finally get the correct url.
The parse function starts getting the item attributes without any troubles but when i want the start URL ( the true one ) it stores the redirected one ...
I've tryed:
response.url
response.request.meta.get('redirect_urls')
Both returns the final url ( the redirected one ) and not the start_url.
Some one know why, or has any clue ?
Thanks in advance.
use a Spider Middleware to keep track of the start url from every response:
from scrapy import Request
class StartRequestsMiddleware(object):
start_urls = {}
def process_start_requests(self, start_requests, spider):
for i, request in enumerate(start_requests):
request.meta.update(start_url=request.url)
yield request
def process_spider_output(self, response, result, spider):
for output in result:
if isinstance(output, Request):
output.meta.update(
start_url=response.meta['start_url'],
)
yield output
keep track of the start_url every response comes from with:
response.meta['start_url']
Have you tried response.request.url? I personally would override the start_requests method adding the original url in the meta, something like:
yield Request(url, meta={'original_request': url})
And then extract it using response.meta['original_request']

Scrapy - parse a url without crawling

I have a list of urls I want to scrape and follow all the same pipelines. How do I begin this? I'm not actually sure where to even start.
The main idea is my crawl works through a site and pages. It then yields to parse the page and update a database. What I am now trying to achieve is to now parse the page of all the existing urls in the database which were not crawled that day.
I have tried doing this in a pipeline using the close_spider method, but can't get these urls to Request/parse. Soon as I yield the whole close_spider method is no longer called.
def close_spider(self, spider):
for item in models.Items.select().where(models.Items.last_scraped_ts < '2016-02-06 10:00:00'):
print item.url
yield Request(item.url, callback=spider.parse_product, dont_filter=True)
(re-reading your thread, I am not sure I answering your question at all...)
I have done something similar without scrapy but modules lxml and request
The url:
listeofurl=['url1','url2']
or if Url have a pattern generate them:
for i in range(0,10):
url=urlpattern+str(i)
Then I made a loop that parse each url which has the same pattern:
import json
from lxml import html
import requests
listeOfurl=['url1','url2']
mydataliste={};
for url in listeOfurl:
page = requests.get(url)
tree = html.fromstring(page.content)
DataYouWantToKeep= tree.xpath('//*[#id="content"]/div/h2/text()[2]')
data[url]=DataYouWantToKeep
#and at the end you save all the data in Json
with open(filejson, 'w') as outfile:
json.dump(data, outfile)
You could simply copy and paste the urls into start_urls, if you don't have override start_requests, parse will be the default call back. If it is a long list and you don't want ugly code, you can just override the start_requests, open your file or do a db call, and for each item within yield a request for that url and callback to parse. This will let you use your parse function and your pipelines, as well as handle concurrency through scrapy. If you just have a list without that extra infrastructure already existing and the list isn't too long, Sulot's answer is easier.

Scrapy Start_request parse

I am writing a scrapy script to search and scrape result from a website. I need to search items from website and parse each url from the search results. I started with Scrapy's start_requests where i'd pass the search query and redirect to another function parse which will retrieve the urls from the search result. Finally i called another function parse_item to parse the results. I'm able to extract the all the search results url but i'm not being able to parse the results ( parse_item is not working). Here is the code:
# -*- coding: utf-8 -*-
from scrapy.http.request import Request
from scrapy.spider import BaseSpider
class xyzspider(BaseSpider):
name = 'dspider'
allowed_domains = ["www.example.com"]
mylist = ['Search item 1','Search item 2']
url = 'https://example.com/search?q='
def start_requests(self):
for i in self.mylist:
i = i.replace(' ','+')
starturl = self.url+ i
yield Request(starturl,self.parse)
def parse(self,response):
itemurl = response.xpath(".//section[contains(#class, 'search-results')]/a/#href").extract()
for j in itemurl:
print j
yield Request(j,self.parse_item)
def parse_item(self,response):
print "hello"
'''rating = response.xpath(".//ul(#class = 'ratings')/li[1]/span[1]/text()").extract()
print rating'''
Could anyone please help me. Thank you.
I was getting a Filtered offsite request error. I changed the allowed domain from allowed_domains = www.xyz.com to
xyz.com and it worked perfectly.
Your code looks good. So you might need to use the Request attribute dont_filter set to True:
yield Request(j,self.parse_item, dont_filter=True)
From the docs:
dont_filter (boolean) – indicates that this request should not be filtered by the scheduler. This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to False.
Anyway I recommend you to have a look at the item Pipelines.
Those are used to process scraped items using the command:
yield my_object
Item pipelines are used to post-process everything yielded by the spider.

Categories

Resources