I've been trying to implement a parse function.
Essentially I figured out through the scrapy shell that
response.xpath('//*[#id="PagerAfter"]/a[last()]/#href')).extract()[0]
gives me the url directing me to the next page. So I tried following the instructions with next_page. I took a look around stack overflow and it seems that everyone uses rule(LinkExtractor... which I don't believe I need to use. I'm pretty sure I'm doing it completely wrong though. I originally had a for loop that added every link I wanted to visit in the start_urls because I knew it was all in the form of *p1.html, *p2.html .. etc. but I want to make this smarter.
def parse(self, response):
items = []
for sel in response.xpath('//div[#class="Message"]'):
itemx = mydata()
itemx['information'] = sel.extract()
items.append(itemx)
with open('log.txt', 'a') as f:
f.write('\ninformation: ' + itemx.get('information')
#URL of next page response.xpath('//*[#id="PagerAfter"]/a[last()]/#href').extract()[0]
next_page = (response.xpath('//*[#id="PagerAfter"]/a[last()]/#href'))
if (response.url != response.xpath('//*[#id="PagerAfter"]/a[last()]/#href')):
if next_page:
yield Request(response.xpath('//*[#id="PagerAfter"]/a[last()]/#href')[0], self.parse)
return items
but does not work I get a
next_page = (response.xpath('//*[#id="PagerAfter"]/a[last()]/#href'))
^SyntaxError: invalid syntax
error. Additionally I know that the yield Request part is wrong. I want to recursively call and recursively add each scrape of each page into the list items.
Thank you!
Related
I’m very new to scrapy, python and coding in general. I have a project where I’d like to collect blog posts to do some content analysis on them in Atlas.ti 8. Atlas supports files like .html, .txt., docx and PDF.
I’ve built my crawler based on the scrapy tutorial: https://docs.scrapy.org/en/latest/intro/tutorial.html
My main issue is that I’m unable to save the posts in their own files. I can download them as one batch with scrapy crawl <crawler> -o filename.csv but from the csv I’ve to use VBA to put the posts in their own files row by row. This is a step I’d like to avoid.
My current code can be see below.
import scrapy
class BlogCrawler(scrapy.Spider):
name = "crawler"
start_urls = ['url']
def parse(self,response):
postnro = 0
for post in response.css('div.post'):
postnro += 1
yield {
'Post nro: ': postnro,
'date': post.css('.meta-date::text').get().replace('\r\n\t\ton', '').replace('\t',''),
'author': post.css('.meta-author i::text').get(),
'headline': post.css('.post-title ::text').get(),
'link': post.css('h1.post-title.single a').attrib['href'],
'text': [item.strip() for item in response.css('div.entry ::text').getall()],
}
filename = f'post-{postnro}.html'
with open(filename, 'wb') as f:
f.write(???)
next_page = response.css('div.alignright a').attrib['href']
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
I’ve no idea how I should go about saving the results. I’ve tried to input response.body, response.text and TextResponse.text to f.write() to no avail. I’ve also tried to collect the data in a for loop and save it like: f.write(date + ‘\n’, author + ‘\n'...) Approaches like these produce empty, 0 KB files.
The reason I’ve set the file type to ‘html’ is because Atlas can take it as it is and the whitespace won’t be an issue. In principle the filetype could also be .txt. However, if I manage to save posts as html, I evade the secondary issue in my project. The getall() creates a list which is why strip(), replace() as well as w3lib methods are hard to implement to clean the data. The current code replaces the whitespace with commas which is readable but it could be better.
If anyone has ideas on how to save each blog post in separate file, one post per file, I'd be happy to hear them.
Best regards,
Leeward
Managed to crack this after good night's sleep and some hours of keyboard (and head) bashing. It is not pretty or elegant and does not make use of Scrapy's advanced features, but suffices for now. This does not solve my secondary issue, but with that I can live with this being my first crawling project. There were multiple issues with my code:
"postnro" was not being updated so the code kept writing the same file over and over again. I was unable to make it work, so I used "date" instead. Could have used post's unique id as well, but those were so random, I would not haven known what file I was working with without opening the said file.
I could not figure out how to save yield to a file so I for looped what I wanted and saved the results one by one.
I switched the filetype from .html to .txt, but it took me some time
to figure out and switch 'wb' to plain 'w'.
For those interested, working code (so to speak) below:
def parse(self,response):
for post in response.css('div.post'):
date = post.css('.meta-date::text').get().replace('\r\n\t\ton ', '').replace('\t','')
author = post.css('.meta-author i::text').get()
headline = post.css('.post-title ::text').get()
link = post.css('h1.post-title.single a').attrib['href']
text = [item.strip() for item in post.css('div.entry ::text').getall()]
filename = f'post-{date}.txt'
with open(filename, 'w') as f:
f.write(str(date) + '\n' + str(author) + '\n' + str(headline) + '\n' + str(link) + '\n'+'\n'+ str(text) + '\n')
next_page = response.css('div.alignleft a::attr(href)').get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
I'd like to crawl a certain url which returns a random response each time it's called. Code below returns what I want but I'd like to run it for long time so that I can use the data for an NLP application. This code only runs for once with scrapy crawl the though I expect it to run more due to last if statement.
Is Unix's start command is what I'm looking for? I tried it but it felt a bit slow. If I had to use start command would opening many tabs in terminal and running same command with start prefix be a good practice or it just throttles the speed?
class TheSpider(scrapy.Spider):
name = 'the'
allowed_domains = ['https://websiteiwannacrawl.com']
start_urls = ['https://websiteiwannacrawl.com']
def parse(self, response):
info = {}
info['text'] = response.css('.pd-text').extract()
yield info
next_page = 'https://websiteiwannacrawl.com'
if next_page is not None:
yield scrapy.Request(next_page, callback=self.parse)
dont_filter
indicates that this request should not be filtered by the scheduler.
This is used when you want to perform an identical request multiple
times, to ignore the duplicates filter. Use it with care, or you will
get into crawling loops. Default to False
You should add this in your Request
yield scrapy.Request(next_page, dont_filter=True)
it's not about your question but for callback=self.parse please read Parse Method
I have this code available from my previous experiment.
import scrapy
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['http://example.com/']
def parse(self, response):
for title in response.css('h2'):
yield {'Agent-name': title.css('a ::text').extract_first()}
next_page = response.css('li.col-md-3 ln-t > div.cs-team team-grid > figure > a ::attr(href)').extract_first()
if next_page:
yield scrapy.Request(response.urljoin(next_page), callback=self.parse)
I am not understanding how to modify this code to take input as a list of URL from a text file (May be 200+ domains) and check the HTTP status of the domains and store it in a file. I am trying this to check whether the domains are live or not.
What I am expecting to have output is:
example.com,200
example1.com,300
example2.com,503
I want to give file as an input to scrapy script and it should give me the above output. I have tried to look at the questions: How to detect HTTP response status code and set a proxy accordingly in scrapy? and Scrapy and response status code: how to check against it?
But find no luck. Hence, I am thinking to modify my code and get it done. How I can do that? Please help me.
For each response object you could be able to get the url and status code thx to response object properties. So for each link you send request to, you can get the status code using response.status.
Does it work as you want like that ?
def parse(self, response):
#file choosen to get output (appending mode):
file.write(u"%s : %s\n" % (response.url, response.status))
#if response.status in [400, ...]: do smthg
for title in response.css('h2'):
yield {'Agent-name': title.css('a ::text').extract_first()}
next_page = response.css('li.col-md-3 ln-t > div.cs-team team-grid > figure > a ::attr(href)').extract_first()
if next_page:
yield scrapy.Request(response.urljoin(next_page), callback=self.parse)
Disclaimer: I'm fairly new to Scrapy.
To put my question plainly: How can I retrieve an Item property from a link on a page and get the results back into the same Item?
Given the following sample Spider:
class SiteSpider(Spider):
site_loader = SiteLoader
...
def parse(self, response):
item = Place()
sel = Selector(response)
bl = self.site_loader(item=item, selector=sel)
bl.add_value('domain', self.parent_domain)
bl.add_value('origin', response.url)
for place_property in item.fields:
parse_xpath = self.template.get(place_property)
# parse_xpath will look like either:
# '//path/to/property/text()'
# or
# {'url': '//a[#id="Location"]/#href',
# 'xpath': '//div[#class="directions"]/span[#class="address"]/text()'}
if isinstance(parse_xpath, dict): # place_property is at a URL
url = sel.xpath(parse_xpath['url_elem']).extract()
yield Request(url, callback=self.get_url_property,
meta={'loader': bl, 'parse_xpath': parse_xpath,
'place_property': place_property})
else: # parse_xpath is just an xpath; process normally
bl.add_xpath(place_property, parse_xpath)
yield bl.load_item()
def get_url_property(self, response):
loader = response.meta['loader']
parse_xpath = response.meta['parse_xpath']
place_property = response.meta['place_property']
sel = Selector(response)
loader.add_value(place_property, sel.xpath(parse_xpath['xpath'])
return loader
I'm running these spiders against multiple sites, and most of them have the data I need on one page and it works just fine. However, some sites have certain properties on a sub-page (ex., the "address" data existing at the "Get Directions" link).
The "yield Request" line is really where I have the problem. I see the items move through the pipeline, but they're missing those properties that are found at other URLs (IOW, those properties that get to "yield Request"). The get_url_property callback is basically just looking for an xpath within the new response variable, and adding that to the item loader instance.
Is there a way to do what I'm looking for, or is there a better way? I would like to avoid making a synchronous call to get the data I need (if that's even possible here), but if that's the best way, then maybe that's the right approach. Thanks.
If I understand you correctly, you have (at least) two different cases:
The crawled page links to another page containing the data (1+ further request necessary)
The crawled page contains the data (No further request necessary)
In your current code, you call yield bl.load_item() for both cases, but in the parse callback. Note that the request you yield is executed some later point in time, thus the item is incomplete and that's why you're missing the place_property key from the item for the first case.
Possible Solution
A possible solution (If I understood you correctly) Is to exploit the asynchronous behavior of Scrapy. Only minor changes to your code are involved.
For the first case, you pass the item loader to another request, which will then yield it. This is what you do in the isinstance if clause. You'll need to change the return value of the get_url_property callback to actually yield the loaded item.
For the second case, you can return the item directly,
thus simply yield the item in the else clause.
The following code contains the changes to your example.
Does this solve your problem?
def parse(self, response):
# ...
if isinstance(parse_xpath, dict): # place_property is at a URL
url = sel.xpath(parse_xpath['url_elem']).extract()
yield Request(url, callback=self.get_url_property,
meta={'loader': bl, 'parse_xpath': parse_xpath,
'place_property': place_property})
else: # parse_xpath is just an xpath; process normally
bl.add_xpath(place_property, parse_xpath)
yield bl.load_item()
def get_url_property(self, response):
loader = response.meta['loader']
# ...
loader.add_value(place_property, sel.xpath(parse_xpath['xpath'])
yield loader.load_item()
Related to that problem is the question of chaining requests, for which I have noted a similar solution.
I keep getting an error: invaled syntax for
1.add_xpath('tagLine', '//p[#class="tagline"]/text()')
and I cannot seem to figure out why it is giving me that error, since as far as i can tell it is the same syntax as all of the other 1.add_xpath() methods. my other question is how do I request other pages. basically I am going through one big page and having it go through each link on the page, then once it is done with the page I want it to go to the next (button) for the next large page, but I don't know how to do that.
def parse(self, response):
hxs = HtmlXPathSelector(response)
for url in hxs.select('//a[#class="title"]/#href').extract():
yield Request(url, callback=self.description_page)
for url_2 in hxs.select('//a[#class="POINTER"]/#href').extract():
yield Request(url_2, callback=self.description_page)
def description_page(self, response):
l = XPathItemLoader(item=TvspiderItem(), response=response)
l.add_xpath('title', '//div[#class="m show_head"]/h1/text()')
1.add_xpath('tagLine', '//p[#class="tagline"]/text()')
1.add_xpath('description', '//div[#class="description"]/span')
1.add_xpath('rating', '//div[#class="score"]/text()')
1.add_xpath('imageSrc', '//div[#class="image_bg"]/img/#src')
return l.load_item()
any help on this would be greatly appreciated. I am still a bit of a noob when it comes to python and scrapy.
def description_page(self, response):
l = XPathItemLoader(item=TvspiderItem(), response=response)
l.add_xpath('title', '//div[#class="m show_head"]/h1/text()')
1.add_xpath('tagLine', '//p[#class="tagline"]/text()')
1.add_xpath('description', '//div[#class="description"]/span')
1.add_xpath('rating', '//div[#class="score"]/text()')
1.add_xpath('imageSrc', '//div[#class="image_bg"]/img/#src')
return l.load_item()
You have digit 1 instead of variable name l.