Scrapy - remove duplicates and output data as a single list?

Scrapy - remove duplicates and output data as a single list? - python

I am using the below code to crawl through multiple links on a page and grab a list of data from each corresponding link:
import scrapy
class testSpider(scrapy.Spider):
name = "quotes"
start_urls = ['http://www.website.com']
def parse(self, response):
urls = response.css('div.subject_wrapper > a::attr(href)').extract()
for url in urls:
url = response.urljoin(url)
yield scrapy.Request(url=url, callback=self.getData)
def getData(self, response):
data = {'data': response.css('strong.data::text').extract()}
yield data
It works fine, but as it's returning a list of data for each link, when I output to CSV it looks like the following:
"dalegribel,Chad,Ninoovcov,dalegribel,Gotenks,sillydog22"
"kaylachic,jmargerum,kaylachic"
"Kempodancer,doctordbrew,Gotenks,dalegribel"
"Gotenks,dalegribel,jmargerum"
...
Is there any simple/efficient way of outputting the data as a single list of rows without any duplicates (the same data can appear on multiple pages), similar to the following?
dalegribel
Chad
Ninoovcov
Gotenks
...
I have tried using an array then looping over each element to get an output, but get an error saying yield only supports 'Request, BaseItem, dict or None'. Also, as I would be running this over approx 10k entries, i'm not sure if storing the data in an array would slow the scrape down too much. Thanks.

Not sure if it can be somehow done using Scrapy built-in methods, but the python way would be to create a set of unique elements, check for duplicates, and yeild only unique elements:
class testSpider(scrapy.Spider):
name = "quotes"
start_urls = ['http://www.website.com']
unique_data = set()
def parse(self, response):
urls = response.css('div.subject_wrapper > a::attr(href)').extract()
for url in urls:
url = response.urljoin(url)
yield scrapy.Request(url=url, callback=self.getData)
def getData(self, response):
data_list = response.css('strong.data::text').extract()
for elem in data_list:
if elem and (elem not in self.unique_data):
self.unique_data.add(elem)
yield {'data': elem}

Related

List elements retrieved by Xpath in scrapy do not output correctly item by item(for,yield)

I am outputting the URL of the first page of the order results page of an exhibitor extracted from a specific EC site to a csv file, reading it in start_requests, and looping through it with a for statement.
Each order result page contains information on 30 products.
https://www.buyma.com/buyer/2597809/sales_1.html
itempage
Specify the links for the 30 items on each order results page and list? type, and I tried to retrieve them one by one and store them in the item as shown in the code below, but it does not work.
class AllSaledataSpider(CrawlSpider):
name = 'all_salesdata_copy2'
allowed_domains = ['www.buyma.com']
def start_requests(self):
with open('/Users/morni/researchtool/AllshoppersURL.csv', 'r', encoding='utf-8-sig') as f:
reader = csv.reader(f)
for row in reader:
for n in range(1, 300):
url =str((row[2])[:-5]+'/sales_'+str(n)+'.html')
yield scrapy.Request(
url=url,
callback=self.parse_firstpage_item,
dont_filter=True
)
def parse_firstpage_item(self, response):
loader = ItemLoader(item = ResearchtoolItem(), response = response)
Conversion_date = response.xpath('//*[#id="buyeritemtable"]/div/ul/li[2]/p[3]/text()').getall()
product_name = response.xpath('//*[#id="buyeritemtable"]/div/ul/li[2]/p[1]/a/text()').getall()
product_URL = response.xpath('//*[#id="buyeritemtable"]/div/ul/li[2]/p[1]/a/#href').getall()
for i in range(30):
loader.add_value("Conversion_date", Conversion_date[i])
loader.add_value("product_name", product_name[i])
loader.add_value("product_URL", product_URL[i])
yield loader.load_item()
Specify the links for the 30 items on each order results page and list? type, and I tried to retrieve them one by one and store them in the item as shown in the code below, but it does not work.
The output is as follows, where each item contains multiple items of information at once.
Current status:
{"product_name": ["product1", "product2"]), "Conversion_date":["Conversion_date1", "Conversion_date2" ], "product_URL":["product_URL1", "product_URL2"]},
Ideal:
[{"product_name": "product1", "Conversion_date": Conversion_date1", "product_URL": "product_URL1"},{"product_name": "product2", "Conversion_date": Conversion_date2", "product_URL": "product_URL2"}]
This may be due to my lack of understanding of basic for statements and yield.

You need to create a new loader each iteration
for i in range(30):
loader = ItemLoader(item = ResearchtoolItem(), response = response)
loader.add_value("Conversion_date", Conversion_date[i])
loader.add_value("product_name", product_name[i])
loader.add_value("product_URL", product_URL[i])
yield loader.load_item()
EDIT:
add_value appends a value to the list. Since you had zero elements in the list, then after you append you'll have a list with one element.
In order to get the values as a string you can use a processor. Example:
import scrapy
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst
class ProductItem(scrapy.Item):
name = scrapy.Field(output_processor=TakeFirst())
price = scrapy.Field(output_processor=TakeFirst())
class ExampleSpider(scrapy.Spider):
name = 'exampleSpider'
start_urls = ['https://scrapingclub.com/exercise/list_infinite_scroll/']
def parse(self, response, **kwargs):
names = response.xpath('//div[#class="card-body"]//h4/a/text()').getall()
prices = response.xpath('//div[#class="card-body"]//h5//text()').getall()
length = len(names)
for i in range(length):
loader = ItemLoader(item=ProductItem(), response=response)
loader.add_value('name', names[i])
loader.add_value('price', prices[i])
yield loader.load_item()

Scrapy, crawling a dynamic page with multiple pages

For an assignment I am trying to build a spider which is able to fetch data from the "www.kaercher.com" webshop. All the products in the webshop are being called by an AJAX call. In order to load in more products, a button named "show more products", has to be pressed. I managed to fetch the required data from the corresponding URL which is being called by the AJAX Call.
However, for my assignment, I am suppose to fetch all (all products/pages) of a certain product. I've been digging around but I can't find a solution. I suppose I am suppose to do something with "isTruncated = true", true indicates that more products can be loaded, false means that there are no more products. (FIXED)
When I manage to fetch the data from all the pages, I need to find a way to fetch all the data from a list of products (create a .csv file with multiple kaercher products, each product has a unique ID which can be seen in the URL, in this case the ID 20035386 is for the high pressure washer). (FIXED)
Links:
Webshop: https://www.kaercher.com/nl/webshop/hogedrukreinigers-resultaten.html
High pressure washer: https://www.kaercher.com/nl/webshop/hogedrukreinigers-resultaten.html
API Url (page1): https://www.kaercher.com/api/v1/products/search/shoppableproducts/partial/20035386?page=1&size=8&isocode=nl-NL
OLD CODE
Spider file
import scrapy
from krc.items import KrcItem
import json
class KRCSpider(scrapy.Spider):
name = "krc_spider"
allowed_domains = ["kaercher.com"]
start_urls = ['https://www.kaercher.com/api/v1/products/search/shoppableproducts/partial/20035386?page=1&size=8&isocode=nl-NL']
def parse(self, response):
item = KrcItem()
data = json.loads(response.text)
for company in data.get('products', []):
item["productid"] = company["id"]
item["name"] = company["name"]
item["description"] = company["description"]
item["price"] = company["priceFormatted"]
yield item
Items file
import scrapy
class KrcItem(scrapy.Item):
productid=scrapy.Field()
name=scrapy.Field()
description=scrapy.Field()
price=scrapy.Field()
pass
NEW CODE
EDIT: 15/08/2019
Thanks to #gangabass I managed to fetch data from all of the product pages. I also manages to fetch the data from different products which are listed in a keyword.csv file. This enables me to fetch data from a list of products. See below for the new code:
Spider file (.py)
import scrapy
from krc.items import KrcItem
import json
import os
import csv
class KRCSpider(scrapy.Spider):
name = "krc_spider"
allowed_domains = ["kaercher.com"]
start_urls = ['https://www.kaercher.com/api/v1/products/search/shoppableproducts/partial/20035386?page=1&size=8&isocode=nl-NL']
def start_requests(self):
"""Read keywords from keywords file amd construct the search URL"""
with open(os.path.join(os.path.dirname(__file__), "../resources/keywords.csv")) as search_keywords:
for keyword in csv.DictReader(search_keywords):
search_text=keyword["keyword"]
url="https://www.kaercher.com/api/v1/products/search/shoppableproducts/partial/{0}?page=1&size=8&isocode=nl-NL".format(
search_text)
# The meta is used to send our search text into the parser as metadata
yield scrapy.Request(url, callback = self.parse, meta = {"search_text": search_text})
def parse(self, response):
current_page = response.meta.get("page", 1)
next_page = current_page + 1
item = KrcItem()
data = json.loads(response.text)
for company in data.get('products', []):
item["productid"] = company["id"]
item["name"] = company["name"]
item["description"] = company["description"]
item["price"] = company["priceFormatted"].replace("\u20ac","").strip()
yield item
if data["isTruncated"]:
yield scrapy.Request(
url="https://www.kaercher.com/api/v1/products/search/shoppableproducts/partial/20035386?page={page}&size=8&isocode=nl-NL".format(page=next_page),
callback=self.parse,
meta={'page': next_page},
)
Items file (.py)
import scrapy
class KrcItem(scrapy.Item):
productid=scrapy.Field()
name=scrapy.Field()
description=scrapy.Field()
price=scrapy.Field()
producttype=scrapy.Field()
pass
keywords file (.csv)
keyword,keywordtype
20035386,Hogedrukreiniger
20072956,Floor Cleaner

You can use response.meta to send current page number between requests:
def parse(self, response):
current_page = response.meta.get("page", 1)
next_page = current_page + 1
item = KrcItem()
data = json.loads(response.text)
for company in data.get('products', []):
item["productid"] = company["id"]
item["name"] = company["name"]
item["description"] = company["description"]
item["price"] = company["priceFormatted"]
yield item
if data["isTruncated"]:
yield scrapy.Request(
url="https://www.kaercher.com/api/v1/products/search/shoppableproducts/partial/20035386?page={page}&size=8&isocode=nl-NL".format(page=next_page),
callback=self.parse,
meta={'page': next_page},
)

In Scrapy, how do I pass the urls generated in one class to the next class in the script?

The following is my spider's code:
import scrapy
class ProductMainPageSpider(scrapy.Spider):
name = 'ProductMainPageSpider'
start_urls = ['http://domain.com/main-product-page']
def parse(self, response):
for product in response.css('article.isotopeItem'):
yield {
'title': product.css('h3 a::text').extract_first().encode("utf-8"),
'category': product.css('h6 a::text').extract_first(),
'img': product.css('figure a img::attr("src")').extract_first(),
'url': product.css('h3 a::attr("href")').extract_first()
}
class ProductSecondaryPageSpider(scrapy.Spider):
name = 'ProductSecondaryPageSpider'
start_urls = """ URLS IN product['url'] FROM PREVIOUS CLASS """
def parse(self, response):
for product in response.css('article.isotopeItem'):
yield {
'title': product.css('h3 a::text').extract_first().encode("utf-8"),
'thumbnail': product.css('figure a img::attr("src")').extract_first(),
'short_description': product.css('div.summary').extract_first(),
'description': product.css('div.description').extract_first(),
'gallery_images': product.css('figure a img.gallery-item ::attr("src")').extract_first()
}
The first class/part works correctly if I remove the second class/part. It generates my json file correctly with the items requested in it. However, the website I need to crawl is a two-parter. It has a product archive page that shows a products as a thumbnail, title, and category (and this info is not in the next page). Then if you click on one of the thumbnails or titles you get sent to a single product page where there is specific info on the product.
There are a lot of products so I would like to pipe (yield?) the urls in product['url'] to the second class as the "start_urls" list. But I simply don't know how to do that. My knowledge doesn't go far enough to even know what I'm missing or what is going wrong so that I can find a solution.
Check out on line 20 what I want to do.

You don't have to create two spiders for this - you can simply go to the next url and carry over your item i.e.:
def parse(self, response):
item = MyItem()
item['name'] = response.xpath("//name/text()").extract()
next_page_url = response.xpath("//a[#class='next']/#href").extract_first()
yield Request(next_page_url,
self.parse_next,
meta={'item': item} # carry over our item
)
def parse_next(self, response):
# get our carried item from response meta
item = response.meta['item']
item['description'] = response.xpath("//description/text()").extract()
yield item
However if for some reason you realy want to split logic of these two steps you can simply save the results in a file (a json for example: scrapy crawl first_spider -o results.json) and open/iterate through it in your second spider in start_requests() class method which would yield urls, i.e.:
import json
from scrapy import spider
class MySecondSpider(spider):
def start_requests(self):
# this overrides `start_urls` logic
with open('results.json', 'r') as f:
data = json.loads(f.read())
for item in data:
yield Request(item['url'])

Scrapy request does not callback

I am trying to create a spider that takes data from a csv (two links and a name per row), and scrapes a simple element (price) from each of those links, returning an item for each row, with the item's name being the name in the csv, and two scraped prices (one from each link).
Everything works as expected except the fact that instead of returning the prices, that would be returned from the callback function of each request, I get a request object like this :
< GET https://link.com>..
The callback functions don't get called at all, why is that?
Here is the spider:
f = open('data.csv')
f_reader = csv.reader(f)
f_data = list(f_reader)
parsed_data = []
for product in f_data:
product = product[0].split(';')
parsed_data.append(product)
f.close()
class ProductSpider(scrapy.Spider):
name = 'products'
allowed_domains = ['domain1', 'domain2']
start_urls = ["domain1_but_its_fairly_useless"]
def parse(self, response):
global parsed_data
for product in parsed_data:
item = Product()
item['name'] = product[0]
item['first_price'] = scrapy.Request(product[1], callback=self.parse_first)
item['second_price'] = scrapy.Request(product[2], callback=self.parse_second)
yield item
def parse_first(self, response):
digits = response.css('.price_info .price span').extract()
decimals = response.css('.price_info .price .price_demicals').extract()
yield float(str(digits)+'.'+str(decimals))
def parse_second(self, response):
digits = response.css('.lr-prod-pricebox-price .lr-prod-pricebox-price-primary span[itemprop="price"]').extract()
yield digits
Thanks in advance for your help!

TL;DR: You are yielding an item with Request objects inside of it when you should yield either Item or Request.
Long version:
Parse methods in your spider should either return a scrapy.Item - in which case the chain for that crawl will stop and scrapy will put out an item or a scrapy.Requests in which case scrapy will schedule a request to continue the chain.
Scrapy is asynchronious so to create an item from multiple requests means you need to chain all of those requests while carrying your item to every one of item and fill it up little by little.
Request object has meta attribute where you can store anything you want to (well pretty much) and it will be carried to your callback function. It's very common to use it to chain requests for items that require multiple requests to form a single item.
Your spider should look something like this:
class ProductSpider(scrapy.Spider):
# <...>
def parse(self, response):
for product in parsed_data:
item = Product()
item['name'] = product[0]
# carry next url you want to crawl in meta
# and carry your item in meta
yield Request(product[1], self.parse_first,
meta={"product3": product[2], "item":item})
def parse_first(self, response):
# retrieve your item that you made in parse() func
item = response.meta['item']
# fill it up
digits = response.css('.price_info .price span').extract()
decimals = response.css('.price_info .price .price_demicals').extract()
item['first_price'] = float(str(digits)+'.'+str(decimals))
# retrieve next url from meta
# carry over your item to the next url
yield Request(response.meta['product3'], self.parse_second,
meta={"item":item})
def parse_second(self, response):
# again, retrieve your item
item = response.meta['item']
# fill it up
digits = response.css('.lr-prod-pricebox-price .lr-prod-pricebox-price-primary
span[itemprop="price"]').extract()
item['secodn_price'] = digits
# and finally return the item after 3 requests!
yield item

Can scrapy yield both request and items?

When I write parse() function, can I yield both a request and items for one single page?
I want to extract some data in page A and then store the data in database, and extract links to be followed (this can be done by rule in CrawlSpider).
I call the links pages of A pages is B pages, so I can write another parse_item() to extract data from B pages, but I want to extract some links in B pages, so I can only use rule to extract links? how to tackle with the duplicate URLs in Scrapy?

Yes, you can yield both requests and items. From what I've seen:
def parse(self, response):
hxs = HtmlXPathSelector(response)
base_url = response.url
links = hxs.select(self.toc_xpath)
for index, link in enumerate(links):
href, text = link.select('#href').extract(), link.select('text()').extract()
yield Request(urljoin(base_url, href[0]), callback=self.parse2)
for item in self.parse2(response):
yield item

I'm not 100% I understand your question but the code below request sites from a starting url using the basespider, then scans the starting url for href's then loops each link calling parse_url. everything matched in parse_url is sent to your item pipeline.
def parse(self, response):
hxs = HtmlXPathSelector(response)
urls = hxs.select('//a[contains(#href, "content")]/#href').extract() ## only grab url with content in url name
for i in urls:
yield Request(urlparse.urljoin(response.url, i[1:]),callback=self.parse_url)
def parse_url(self, response):
hxs = HtmlXPathSelector(response)
item = ZipgrabberItem()
item['zip'] = hxs.select("//div[contains(#class,'odd')]/text()").extract() ## this bitch grabs it
return item

from Steven Almeroth in google groups:
You are right, you can yield Requests and return a list of Items, but that is not what you are attempting. You are attempting to yield a list of Items instead of return'ing them. And since you already are using parse() as a generator function you cannot have both yield and return together. But you can have many yields.
Try this:
def parse(self, response):
hxs = HtmlXPathSelector(response)
base_url = response.url
links = hxs.select(self.toc_xpath)
for index, link in enumerate(links):
href, text = link.select('#href').extract(), link.select('text()').extract()
yield Request(urljoin(base_url, href[0]), callback=self.parse2)
for item in self.parse2(response):
yield item

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrapy - remove duplicates and output data as a single list? - python

Related

List elements retrieved by Xpath in scrapy do not output correctly item by item(for,yield)

Scrapy, crawling a dynamic page with multiple pages

In Scrapy, how do I pass the urls generated in one class to the next class in the script?

Scrapy request does not callback

Can scrapy yield both request and items?

Categories

Resources