Scrapy parse to pipeline - python

For example I want to crawl three similar urls:
https://example.com/book1
https://example.com/book2
https://example.com/book3
What I want is in the pipeline.py, that I create 3 files named book1, book2 and book3, and write the 3 books' data correctly and separately
In the spider.py, I know the three books' name which as the file name, but not in the pipeline.py
They have a same structure, so I decide to code like below:
class Book_Spider(scrapy.Spider):
def start_requests(self):
for url in urls:
yield scrapy.Request(url, self.parse)
def parse(self, response):
# item handling
yield item
Now, how can I do?

Smith, If you want to know book name in pipeline.py. there are two options either you make a item field for book_file_name and populate it accordingly as you want. or you can extract it from url field which is also a item field and can access in pipline.py

Related

Scrapy one item with multiple parsing functions

I am using Scrapy with python to scrape a website and I face some difficulties with filling the item that I have created.
The products are properly scraped and everything is working well as long as the info is located within the response.xpath mentioned in the for loop.
'trend' and 'number' are properly added to the Item using ItemLoader.
However, the date of the product is not located within the response.xpath cited below but in the response.css as a title : response.css('title')
import scrapy
import datetime
from trends.items import Trend_item
from scrapy.loader import ItemLoader
#Initiate the spider
class trendspiders(scrapy.Spider):
name = 'milk'
start_urls = ['https://thewebsiteforthebestmilk/ireland/2022-03-16/7/']
def parse(self, response):
for milk_unique in response.xpath('/html/body/main/div/div[2]/div[1]/section[1]/div/div[3]/table/tbody/tr'):
l = ItemLoader(item=Milk_item(), selector=milk_unique, response=response)
l.add_css('milk', 'a::text')
l.add_css('number', 'span.small.text-muted::text')
return l.load_item()
How can I add the 'date' to my item please (found in response.css('title')?
I have tried to add l.add_css('date', "response.css('title')")in the for loop but it returns an error.
Should I create a new parsing function? If yes then how to send the info to the same Item?
I hope I’ve made myself clear.
Thank you very much for your help,
Since the date is outside of the selector you are using for each row, what you should do is extract that first before your for loop, since it doesn't need to be updated on each iteration.
Then with your item loader you can just use l.add_value to load it with the rest of the fields.
For example:
class trendspiders(scrapy.Spider):
name = 'trends'
start_urls = ['https://getdaytrends.com/ireland/2022-03-16/7/']
def parse(self, response):
date_str = response.xpath("//title/text()").get()
for trend_unique in response.xpath('/html/body/main/div/div[2]/div[1]/section[1]/div/div[3]/table/tbody/tr'):
l = ItemLoader(item=Trend_item(), selector=trend_unique, response=response)
l.add_css('trend', 'a::text')
l.add_css('number', 'span.small.text-muted::text')
l.add_value('date', date_str)
yield l.load_item()
If response.css('title').get() gives you the answer you need, why not use the same CSS with add_css:
l.add_css('date', 'title')
Also, .add_css('date', "response.css('title')") is invalid because the second argument a valid CSS selector.

Scrapy - Selecting and crawling a specific type of sitemap nodes

This is the sitemap of the website I'm crawling. The 3rd and 4th <sitemap> nodes have the urls which goes to the item details. Is there any way to apply crawling logic only to those
nodes? (like selecting them by their indices)
class MySpider(SitemapSpider):
name = 'myspider'
sitemap_urls = [
'https://www.dfimoveis.com.br/sitemap_index.xml',
]
sitemap_rules = [
('/somehow targeting the 3rd and 4th node', 'parse_item')
]
def parse_item(self, response):
# scraping the item
You don't need to use SitemapSpider, just use regex and standard spider.
def start_requests(self):
sitemap = 'https://www.dfimoveis.com.br/sitemap_index.xml'
yield scrapy.Request(url=sitemap, callback=self.parse_sitemap)
def parse_sitemap(self, response):
sitemap_links = re.findall(r"<loc>(.*?)</loc>", response.text, re.DOTALL)
sitemap_links = sitemap_links[2:4] # Only 3rd and 4th nodes.
for sitemap_link in sitemap_links:
yield scrapy.Request(url=sitemap_link, callback=self.parse)
Scrapy’s Spider subclasses, including SitemapSpider are meant to make very common scenarios very easy.
You want to do something that is rather uncommon, so you should read the source code of SitemapSpider, try to understand what it does, and either subclass SitemapSpider overriding the behavior you want to change or directly write your own spider from scratch based on the code of SitemapSpider.

wrong Xpath in IMDB spider scrapy

Here:
IMDB scrapy get all movie data
response.xpath("//*[#class='results']/tr/td[3]")
returns empty list. I tried to change it to:
response.xpath("//*[contains(#class,'chart full-width')]/tbody/tr")
without success.
Any help please? Thanks.
I did not have time to go through IMDB scrapy get all movie data thoroughly, but have got the gist of it. The Problem statement is to get All movie data from the given site. It involves two things. First is to go through all the pages that contain the list of all the movies of that year. While the Second one is to get the link to each movie and then here you do your own magic.
The problem you faced is with the getting the xpath for the link to each movies. This may most likely be due to change in the website structure (I did not have time to verify what maybe the difference). Anyways, following is the xpath you would require.
FIRST :
We take div class nav as a landmark and find the lister-page-next next-page class in its children.
response.xpath("//div[#class='nav']/div/a[#class='lister-page-next next-page']/#href").extract_first()
Here this will give : Link for the next page | returns None if at the last page (since next-page tag not present)
SECOND :
This is the original doubt by the OP.
#Get the list of the container having the title, etc
list = response.xpath("//div[#class='lister-item-content']")
#From the container extract the required links
paths = list.xpath("h3[#class='lister-item-header']/a/#href").extract()
Now all you would need to do is loop through each of these paths element and request the page.
Thanks for your answer. I eventually used your xPath like so:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from crawler.items import MovieItem
IMDB_URL = "http://imdb.com"
class IMDBSpider(CrawlSpider):
name = 'imdb'
# in order to move the next page
rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=("//div[#class='nav']/div/a[#class='lister-page-next next-page']",)),
callback="parse_page", follow= True),)
def __init__(self, start=None, end=None, *args, **kwargs):
super(IMDBSpider, self).__init__(*args, **kwargs)
self.start_year = int(start) if start else 1874
self.end_year = int(end) if end else 2017
# generate start_urls dynamically
def start_requests(self):
for year in range(self.start_year, self.end_year+1):
# movies are sorted by number of votes
yield scrapy.Request('http://www.imdb.com/search/title?year={year},{year}&title_type=feature&sort=num_votes,desc'.format(year=year))
def parse_page(self, response):
content = response.xpath("//div[#class='lister-item-content']")
paths = content.xpath("h3[#class='lister-item-header']/a/#href").extract() # list of paths of movies in the current page
# all movies in this page
for path in paths:
item = MovieItem()
item['MainPageUrl'] = IMDB_URL + path
request = scrapy.Request(item['MainPageUrl'], callback=self.parse_movie_details)
request.meta['item'] = item
yield request
# make sure that the start_urls are parsed as well
parse_start_url = parse_page
def parse_movie_details(self, response):
pass # lots of parsing....
Runs it with scrapy crawl imdb -a start=<start-year> -a end=<end-year>

Scrapy Extract ld+JSON

How to extract the name and url?
quotes_spiders.py
import scrapy
import json
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ["http://www.lazada.com.my/shop-power-banks2/?price=1572-1572"]
def parse(self, response):
data = json.loads(response.xpath('//script[#type="application/ld+json"]//text()').extract_first())
//how to extract the name and url?
yield data
Data to Extract
<script type="application/ld+json">{"#context":"https://schema.org","#type":"ItemList","itemListElement":[{"#type":"Product","image":"http://my-live-02.slatic.net/p/2/test-product-0601-7378-08684315-8be741b9107b9ace2f2fe68d9c9fd61a-webp-catalog_233.jpg","name":"test product 0601","offers":{"#type":"Offer","availability":"https://schema.org/InStock","price":"99999.00","priceCurrency":"RM"},"url":"http://www.lazada.com.my/test-product-0601-51348680.html?ff=1"}]}</script>
This line of code returns a dictionary with the data you want:
data = json.loads(response.xpath('//script[#type="application/ld+json"]//text()').extract_first())
All you need to do is to access it like:
name = data['itemListElement'][0]['name']
url = data['itemListElement'][0]['url']
Given that the microdata contains a list you will need to check you are referring to the correct product in the list.
A really easy solution for this would be to use https://github.com/scrapinghub/extruct. It handles all the hard parts of extracting structured data.

Python Scrapy: passing properties into parser

I'm new to Scrapy and web-scraping in general so this might be a stupid question but it wouldn't be the first time so here goes.
I have a simple Scrapy spider, based on the tutorial example, that processes various URLs (in start_urls). I would like to categorise the URLs e.g. URLs A, B, and C are Category 1, while URLS D and E are Category 2, then be able to store the category on the resulting Items when the parser processes the response for each URL.
I guess I could have a separate spider for each category, then just hold the category as an attribute on the class so the parser can pick it up from there. But I was kind of hoping I could have just one spider for all the URLs, but tell the parser which category to use for a given URL.
Right now, I'm setting up the URLs in start_urls via my spider's init() method. How do I pass the category for a given URL from my init method to the parser so that I can record the category on the Items generated from the responses for that URL?
As paul t. suggested:
class MySpider(CrawlSpider):
def start_requests(self):
...
yield Request(url1, meta={'category': 'cat1'}, callback=self.parse)
yield Request(url2, meta={'category': 'cat2'}, callback=self.parse)
...
def parse(self, response):
category = response.meta['category']
...
You use start_requests to have control over the first URLs you're visiting, attaching metadata to each URL, and you can access that metadata through response.meta afterwards.
Same thing if you need to pass data from a parse function to a parse_item, for instance.

Categories

Resources