How to make indexing work with ItemLoader's add_xpath method - python

I'm trying to rewrite this piece of code to use ItemLoader class:
import scrapy
from ..items import Book
class BasicSpider(scrapy.Spider):
...
def parse(self, response):
item = Book()
# notice I only grab the first book among many there are on the page
item['title'] = response.xpath('//*[#class="link linkWithHash detailsLink"]/#title')[0].extract()
return item
The above works perfectly well. And now the same with ItemLoader:
from scrapy.loader import ItemLoader
class BasicSpider(scrapy.Spider):
...
def parse(self, response):
l = ItemLoader(item=Book(), response=response)
l.add_xpath('title', '//*[#class="link linkWithHash detailsLink"]/#title'[0]) # this does not work - returns an empty dict
# l.add_xpath('title', '//*[#class="link linkWithHash detailsLink"]/#title') # this of course work but returns every book title there is on page, not just the first one which is required
return l.load_item()
So I only want to grab the first book title, how do I achieve that?

A problem with your code is that Xpath uses one-based indexing. Another problem is that the index bracket should be inside the string you pass to the add_xpath method.
So the correct code would look like this:
l.add_xpath('title', '(//*[#class="link linkWithHash detailsLink"]/#title)[1]')

Related

Scrapy one item with multiple parsing functions

I am using Scrapy with python to scrape a website and I face some difficulties with filling the item that I have created.
The products are properly scraped and everything is working well as long as the info is located within the response.xpath mentioned in the for loop.
'trend' and 'number' are properly added to the Item using ItemLoader.
However, the date of the product is not located within the response.xpath cited below but in the response.css as a title : response.css('title')
import scrapy
import datetime
from trends.items import Trend_item
from scrapy.loader import ItemLoader
#Initiate the spider
class trendspiders(scrapy.Spider):
name = 'milk'
start_urls = ['https://thewebsiteforthebestmilk/ireland/2022-03-16/7/']
def parse(self, response):
for milk_unique in response.xpath('/html/body/main/div/div[2]/div[1]/section[1]/div/div[3]/table/tbody/tr'):
l = ItemLoader(item=Milk_item(), selector=milk_unique, response=response)
l.add_css('milk', 'a::text')
l.add_css('number', 'span.small.text-muted::text')
return l.load_item()
How can I add the 'date' to my item please (found in response.css('title')?
I have tried to add l.add_css('date', "response.css('title')")in the for loop but it returns an error.
Should I create a new parsing function? If yes then how to send the info to the same Item?
I hope I’ve made myself clear.
Thank you very much for your help,
Since the date is outside of the selector you are using for each row, what you should do is extract that first before your for loop, since it doesn't need to be updated on each iteration.
Then with your item loader you can just use l.add_value to load it with the rest of the fields.
For example:
class trendspiders(scrapy.Spider):
name = 'trends'
start_urls = ['https://getdaytrends.com/ireland/2022-03-16/7/']
def parse(self, response):
date_str = response.xpath("//title/text()").get()
for trend_unique in response.xpath('/html/body/main/div/div[2]/div[1]/section[1]/div/div[3]/table/tbody/tr'):
l = ItemLoader(item=Trend_item(), selector=trend_unique, response=response)
l.add_css('trend', 'a::text')
l.add_css('number', 'span.small.text-muted::text')
l.add_value('date', date_str)
yield l.load_item()
If response.css('title').get() gives you the answer you need, why not use the same CSS with add_css:
l.add_css('date', 'title')
Also, .add_css('date', "response.css('title')") is invalid because the second argument a valid CSS selector.

Order a json by field using scrapy

I have created a spider to scrape problems from projecteuler.net. Here I have concluded my answer to a related question with
I launch this with the command scrapy crawl euler -o euler.json and it outputs an array of unordered json objects, everyone corrisponding to a single problem: this is fine for me because I'm going to process it with javascript, even if I think resolving the ordering problem via scrapy can be very simple.
But unfortunately, ordering items to write in json by scrapy (I need ascending order by id field) seem not to be so simple. I've studied every single component (middlewares, pipelines, exporters, signals, etc...) but no one seems useful for this purpose. I'm arrived at the conclusion that a solution to solve this problem doesn't exist at all in scrapy (except, maybe, a very elaborated trick), and you are forced to order things in a second phase. Do you agree, or do you have some idea? I copy here the code of my scraper.
Spider:
# -*- coding: utf-8 -*-
import scrapy
from eulerscraper.items import Problem
from scrapy.loader import ItemLoader
class EulerSpider(scrapy.Spider):
name = 'euler'
allowed_domains = ['projecteuler.net']
start_urls = ["https://projecteuler.net/archives"]
def parse(self, response):
numpag = response.css("div.pagination a[href]::text").extract()
maxpag = int(numpag[len(numpag) - 1])
for href in response.css("table#problems_table a::attr(href)").extract():
next_page = "https://projecteuler.net/" + href
yield response.follow(next_page, self.parse_problems)
for i in range(2, maxpag + 1):
next_page = "https://projecteuler.net/archives;page=" + str(i)
yield response.follow(next_page, self.parse_next)
return [scrapy.Request("https://projecteuler.net/archives", self.parse)]
def parse_next(self, response):
for href in response.css("table#problems_table a::attr(href)").extract():
next_page = "https://projecteuler.net/" + href
yield response.follow(next_page, self.parse_problems)
def parse_problems(self, response):
l = ItemLoader(item=Problem(), response=response)
l.add_css("title", "h2")
l.add_css("id", "#problem_info")
l.add_css("content", ".problem_content")
yield l.load_item()
Item:
import re
import scrapy
from scrapy.loader.processors import MapCompose, Compose
from w3lib.html import remove_tags
def extract_first_number(text):
i = re.search('\d+', text)
return int(text[i.start():i.end()])
def array_to_value(element):
return element[0]
class Problem(scrapy.Item):
id = scrapy.Field(
input_processor=MapCompose(remove_tags, extract_first_number),
output_processor=Compose(array_to_value)
)
title = scrapy.Field(input_processor=MapCompose(remove_tags))
content = scrapy.Field()
If I needed my output file to be sorted (I will assume you have a valid reason to want this), I'd probably write a custom exporter.
This is how Scrapy's built-in JsonItemExporter is implemented.
With a few simple changes, you can modify it to add the items to a list in export_item(), and then sort the items and write out the file in finish_exporting().
Since you're only scraping a few hundred items, the downsides of storing a list of them and not writing to a file until the crawl is done shouldn't be a problem to you.
By now I've found a working solution using pipeline:
import json
class JsonWriterPipeline(object):
def open_spider(self, spider):
self.list_items = []
self.file = open('euler.json', 'w')
def close_spider(self, spider):
ordered_list = [None for i in range(len(self.list_items))]
self.file.write("[\n")
for i in self.list_items:
ordered_list[int(i['id']-1)] = json.dumps(dict(i))
for i in ordered_list:
self.file.write(str(i)+",\n")
self.file.write("]\n")
self.file.close()
def process_item(self, item, spider):
self.list_items.append(item)
return item
Though it may be non optimal, because the guide suggests in another example:
The purpose of JsonWriterPipeline is just to introduce how to write item pipelines. If you really want to store all scraped items into a JSON file you should use the Feed exports.

wrong Xpath in IMDB spider scrapy

Here:
IMDB scrapy get all movie data
response.xpath("//*[#class='results']/tr/td[3]")
returns empty list. I tried to change it to:
response.xpath("//*[contains(#class,'chart full-width')]/tbody/tr")
without success.
Any help please? Thanks.
I did not have time to go through IMDB scrapy get all movie data thoroughly, but have got the gist of it. The Problem statement is to get All movie data from the given site. It involves two things. First is to go through all the pages that contain the list of all the movies of that year. While the Second one is to get the link to each movie and then here you do your own magic.
The problem you faced is with the getting the xpath for the link to each movies. This may most likely be due to change in the website structure (I did not have time to verify what maybe the difference). Anyways, following is the xpath you would require.
FIRST :
We take div class nav as a landmark and find the lister-page-next next-page class in its children.
response.xpath("//div[#class='nav']/div/a[#class='lister-page-next next-page']/#href").extract_first()
Here this will give : Link for the next page | returns None if at the last page (since next-page tag not present)
SECOND :
This is the original doubt by the OP.
#Get the list of the container having the title, etc
list = response.xpath("//div[#class='lister-item-content']")
#From the container extract the required links
paths = list.xpath("h3[#class='lister-item-header']/a/#href").extract()
Now all you would need to do is loop through each of these paths element and request the page.
Thanks for your answer. I eventually used your xPath like so:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from crawler.items import MovieItem
IMDB_URL = "http://imdb.com"
class IMDBSpider(CrawlSpider):
name = 'imdb'
# in order to move the next page
rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=("//div[#class='nav']/div/a[#class='lister-page-next next-page']",)),
callback="parse_page", follow= True),)
def __init__(self, start=None, end=None, *args, **kwargs):
super(IMDBSpider, self).__init__(*args, **kwargs)
self.start_year = int(start) if start else 1874
self.end_year = int(end) if end else 2017
# generate start_urls dynamically
def start_requests(self):
for year in range(self.start_year, self.end_year+1):
# movies are sorted by number of votes
yield scrapy.Request('http://www.imdb.com/search/title?year={year},{year}&title_type=feature&sort=num_votes,desc'.format(year=year))
def parse_page(self, response):
content = response.xpath("//div[#class='lister-item-content']")
paths = content.xpath("h3[#class='lister-item-header']/a/#href").extract() # list of paths of movies in the current page
# all movies in this page
for path in paths:
item = MovieItem()
item['MainPageUrl'] = IMDB_URL + path
request = scrapy.Request(item['MainPageUrl'], callback=self.parse_movie_details)
request.meta['item'] = item
yield request
# make sure that the start_urls are parsed as well
parse_start_url = parse_page
def parse_movie_details(self, response):
pass # lots of parsing....
Runs it with scrapy crawl imdb -a start=<start-year> -a end=<end-year>

How to use the Rule class in scrapy

I am trying to use the Rule class to go to the next page in my crawler. Here is my code
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from crawler.items import GDReview
class GdSpider(CrawlSpider):
name = "gd"
allowed_domains = ["glassdoor.com"]
start_urls = [
"http://www.glassdoor.com/Reviews/Johnson-and-Johnson-Reviews-E364_P1.htm"
]
rules = (
# Extract next links and parse them with the spider's method parse_item
Rule(SgmlLinkExtractor(restrict_xpaths=('//li[#class="next"]/a/#href',)), follow= True)
)
def parse(self, response):
company_name = response.xpath('//*[#id="EIHdrModule"]/div[3]/div[2]/p/text()').extract()
'''loop over every review in this page'''
for sel in response.xpath('//*[#id="EmployerReviews"]/ol/li'):
review = Item()
review['company_name'] = company_name
review['id'] = str(sel.xpath('#id').extract()[0]).split('_')[1] #sel.xpath('#id/text()').extract()
review['body'] = sel.xpath('div/div[3]/div/div[2]/p/text()').extract()
review['date'] = sel.xpath('div/div[1]/div/time/text()').extract()
review['summary'] = sel.xpath('div/div[2]/div/div[2]/h2/tt/a/span/text()').extract()
yield review
My question is about the rules section. In this rule, the link extracted doesn't contain the domain name. For example, it will return something like
"/Reviews/Johnson-and-Johnson-Reviews-E364_P1.htm"
How can I make sure that my crawler will append the domain to the returned link?
Thanks
You can be sure since this is the default behavior of link extractors in Scrapy (source code).
Also, the restrict_xpaths argument should not point to #href attribute, but instead it should either point to a elements or containers having a elements as descendants. Plus, restrict_xpaths can be defined as string.
In other words, replace:
restrict_xpaths=('//li[#class="next"]/a/#href',)
with:
restrict_xpaths='//li[#class="next"]/a'
Besides, you need to switch to LxmlLinkExtractor from SgmlLinkExtractor:
SGMLParser based link extractors are unmantained and its usage is
discouraged. It is recommended to migrate to LxmlLinkExtractor if you
are still using SgmlLinkExtractor.
Personally, I usually use the LinkExractor shortcut to LxmlLinkExtractor:
from scrapy.contrib.linkextractors import LinkExtractor
To summarize, this is what I would have in rules:
rules = [
Rule(LinkExtractor(restrict_xpaths='//li[#class="next"]/a'), follow=True)
]

Creating a generic scrapy spider

My question is really how to do the same thing as a previous question, but in Scrapy 0.14.
Using one Scrapy spider for several websites
Basically, I have GUI that takes parameters like domain, keywords, tag names, etc. and I want to create a generic spider to crawl those domains for those keywords in those tags. I've read conflicting things, using older versions of scrapy, by either overriding the spider manager class or by dynamically creating a spider. Which method is preferred and how do I implement and invoke the proper solution? Thanks in advance.
Here is the code that I want to make generic. It also uses BeautifulSoup. I paired it down so hopefully didn't remove anything crucial to understand it.
class MySpider(CrawlSpider):
name = 'MySpider'
allowed_domains = ['somedomain.com', 'sub.somedomain.com']
start_urls = ['http://www.somedomain.com']
rules = (
Rule(SgmlLinkExtractor(allow=('/pages/', ), deny=('', ))),
Rule(SgmlLinkExtractor(allow=('/2012/03/')), callback='parse_item'),
)
def parse_item(self, response):
contentTags = []
soup = BeautifulSoup(response.body)
contentTags = soup.findAll('p', itemprop="myProp")
for contentTag in contentTags:
matchedResult = re.search('Keyword1|Keyword2', contentTag.text)
if matchedResult:
print('URL Found: ' + response.url)
pass
You could create a run-time spider which is evaluated by the interpreter. This code piece could be evaluated at runtime like so:
a = open("test.py")
from compiler import compile
d = compile(a.read(), 'spider.py', 'exec')
eval(d)
MySpider
<class '__main__.MySpider'>
print MySpider.start_urls
['http://www.somedomain.com']
I use the Scrapy Extensions approach to extend the Spider class to a class named Masterspider that includes a generic parser.
Below is the very "short" version of my generic extended parser. Note that you'll need to implement a renderer with a Javascript engine (such as Selenium or BeautifulSoup) a as soon as you start working on pages using AJAX. And a lot of additional code to manage differences between sites (scrap based on column title, handle relative vs long URL, manage different kind of data containers, etc...).
What is interresting with the Scrapy Extension approach is that you can still override the generic parser method if something does not fit but I never had to. The Masterspider class checks if some methods have been created (eg. parser_start, next_url_parser...) under the site specific spider class to allow the management of specificies: send a form, construct the next_url request from elements in the page, etc.
As I'm scraping very different sites, there's always specificities to manage. That's why I prefer to keep a class for each scraped site so that I can write some specific methods to handle it (pre-/post-processing except PipeLines, Request generators...).
masterspider/sitespider/settings.py
EXTENSIONS = {
'masterspider.masterspider.MasterSpider': 500
}
masterspider/masterspdier/masterspider.py
# -*- coding: utf8 -*-
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from sitespider.items import genspiderItem
class MasterSpider(Spider):
def start_requests(self):
if hasattr(self,'parse_start'): # First page requiring a specific parser
fcallback = self.parse_start
else:
fcallback = self.parse
return [ Request(self.spd['start_url'],
callback=fcallback,
meta={'itemfields': {}}) ]
def parse(self, response):
sel = Selector(response)
lines = sel.xpath(self.spd['xlines'])
# ...
for line in lines:
item = genspiderItem(response.meta['itemfields'])
# ...
# Get request_url of detailed page and scrap basic item info
# ...
yield Request(request_url,
callback=self.parse_item,
meta={'item':item, 'itemfields':response.meta['itemfields']})
for next_url in sel.xpath(self.spd['xnext_url']).extract():
if hasattr(self,'next_url_parser'): # Need to process the next page URL before?
yield self.next_url_parser(next_url, response)
else:
yield Request(
request_url,
callback=self.parse,
meta=response.meta)
def parse_item(self, response):
sel = Selector(response)
item = response.meta['item']
for itemname, xitemname in self.spd['x_ondetailpage'].iteritems():
item[itemname] = "\n".join(sel.xpath(xitemname).extract())
return item
masterspider/sitespider/spiders/somesite_spider.py
# -*- coding: utf8 -*-
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from sitespider.items import genspiderItem
from masterspider.masterspider import MasterSpider
class targetsiteSpider(MasterSpider):
name = "targetsite"
allowed_domains = ["www.targetsite.com"]
spd = {
'start_url' : "http://www.targetsite.com/startpage", # Start page
'xlines' : "//td[something...]",
'xnext_url' : "//a[contains(#href,'something?page=')]/#href", # Next pages
'x_ondetailpage' : {
"itemprop123" : u"id('someid')//text()"
}
}
# def next_url_parser(self, next_url, response): # OPTIONAL next_url regexp pre-processor
# ...
Instead of having the variables name,allowed_domains, start_urls and rules attached to the class, you should write a MySpider.__init__, call CrawlSpider.__init__ from that passing the necessary arguments, and setting name, allowed_domains etc. per object.
MyProp and keywords also should be set within your __init__. So in the end you should have something like below. You don't have to add name to the arguments, as name is set by BaseSpider itself from kwargs:
class MySpider(CrawlSpider):
def __init__(self, allowed_domains=[], start_urls=[],
rules=[], findtag='', finditemprop='', keywords='', **kwargs):
CrawlSpider.__init__(self, **kwargs)
self.allowed_domains = allowed_domains
self.start_urls = start_urls
self.rules = rules
self.findtag = findtag
self.finditemprop = finditemprop
self.keywords = keywords
def parse_item(self, response):
contentTags = []
soup = BeautifulSoup(response.body)
contentTags = soup.findAll(self.findtag, itemprop=self.finditemprop)
for contentTag in contentTags:
matchedResult = re.search(self.keywords, contentTag.text)
if matchedResult:
print('URL Found: ' + response.url)
I am not sure which way is preferred, but I will tell you what I have done in the past. I am in no way sure that this is the best (or correct) way of doing this and I would be interested to learn what other people think.
I usually just override the parent class (CrawlSpider) and either pass in arguments and then initialize the parent class via super(MySpider, self).__init__() from within my own init-function or I pull in that data from a database where I have saved a list of links to be appended to start_urls earlier.
As far as crawling specific domains passed as arguments goes, I just override Spider.__init__:
class MySpider(scrapy.Spider):
"""
This spider will try to crawl whatever is passed in `start_urls` which
should be a comma-separated string of fully qualified URIs.
Example: start_urls=http://localhost,http://example.com
"""
def __init__(self, name=None, **kwargs):
if 'start_urls' in kwargs:
self.start_urls = kwargs.pop('start_urls').split(',')
super(Spider, self).__init__(name, **kwargs)

Categories

Resources