I wrote a spider in Scrapy and I like to scrape only new items. I know DeltaFetch but I don't think it's possible in my case.
I have urls. For each url, I have items with different values. I'd like to upgrade only these items.
My spider:
import json
import scrapy
import re
import pkgutil
from scrapy.loader import ItemLoader
from auctions_results.items import AuctionItem
from scrapy.exceptions import DropItem
from scrapy.pipelines.images import ImagesPipeline
from datetime import datetime
class GlenMarchSpider(scrapy.Spider):
name = 'auctions_results'
def __init__(self, *args, **kwargs):
data_file = pkgutil.get_data(
"auctions_results", "json/input/scrape_db_complete.json")
self.data = json.loads(data_file)
def start_requests(self):
for item in self.data:
request = scrapy.Request(item['gm_url'], callback=self.parse)
request.meta['item'] = item
yield request
def parse(self, response):
item = response.meta['item']
item['results'] = []
for caritem in response.css("div.car-item-border"):
data = AuctionItem()
data["auction_house"] = caritem.css("div.auctionHouse::text").extract_first().split("-", 1)[0].strip()
auctionurl = caritem.css("div.view-auction a::attr(href)").extract_first()
data["image_urls"] = caritem.css("div.view-auction a img::attr(src)").extract_first()
data["image_cloud"] = None
item['results'].append(data)
yield item
I have a pipeline for images:
import scrapy
from scrapy.pipelines.images import ImagesPipeline
class DownloadImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
for result in item['results']:
image_url = result['image_urls']
if image_url is not None:
request = scrapy.Request(url=image_url)
yield request
I'd like to upgrade only the new items in results including images
Output sample:
"gm_url": "https://www.sitetoscrape.com/page",
"results": [
{
"auction_house": "Artcurial",
"auction_url": null,
"image_urls": "https://www.example.com/img.jpg",
"image_cloud": "https://res.cloudinary.com/img.jpg"
},
{
"auction_house": "Christies",
"auction_url": "http://www.example.com",
"image_urls": "https://www.example.com/img2.jpg",
"image_cloud": "https://res.cloudinary.com/img2.jpg"
}
],
"images": [
{
"url": "https://www.example.com/img.jpg",
"path": "full/img.jpg",
"checksum": "212a6287eed95943a1d51ebc662c44be"
},
{
"url": "https://www.example.com/img2.jpg",
"path": "full/img2.jpg",
"checksum": "ff51dc0a36747bdf022feb83c19a3109"
}
]
} ...
How can I do that ?
I'm using a pipeline in Scrapy to output the scraped results into a JSON file. The pipeline places a comma after each item that is scraped however, I want to drop the comma for the last item. Is there a way to do that?
This is the pipeline:
class ExamplePipeline(object):
def open_spider(self, spider):
self.file = open('example.json', 'w')
self.file.write("[")
def close_spider(self, spider):
self.file.write("]")
self.file.close()
def process_item(self, item, spider):
line = json.dumps(
dict(item),
indent = 4,
sort_keys = True,
separators = (',', ': ')
) + ",\n"
self.file.write(line)
return item
And the sample output looks like:
[
{
"item1": "example",
"item2": "example"
},
{
"item1": "example",
"item2": "example"
},
]
What is the python method to find the last item and not give it a comma separator? I thought I could do something like if item[-1] ... but I can't get that working.
Any ideas?
To apply this to your pipeline, you'll have to seek back in your file and delete that comma:
See related Python - Remove very last character in file
class ExamplePipeline(object):
def close_spider(self, spider):
# go back 2 characters: \n and ,
self.file.seek(-2, os.SEEK_END)
# cut trailing data
self.file.truncate()
# save
self.file.write("]")
self.file.close()
Just a quick question about json export formatting in Scrapy. My exported file looks like this.
{"pages": {"title": "x", "text": "x", "tags": "x", "url": "x"}}
{"pages": {"title": "x", "text": "x", "tags": "x", "url": "x"}}
{"pages": {"title": "x", "text": "x", "tags": "x", "url": "x"}}
But I would like it to be in this exact format. Somehow I need to get all the other information under "pages".
{"pages": [
{"title": "x", "text": "x", "tags": "x", "url": "x"},
{"title": "x", "text": "x", "tags": "x", "url": "x"},
{"title": "x", "text": "x", "tags": "x", "url": "x"}
]}
I'm not very experienced in scrapy or python, but I have gotten everything else done in my spider except the export format. This is my pipelines.py, which I just got working.
from scrapy.exporters import JsonItemExporter
import json
class RautahakuPipeline(object):
def open_spider(self, spider):
self.file = open('items.json', 'w')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
These are the items in my spider.py I need to extract
items = []
for title, text, tags, url in zip(product_title, product_text, product_tags, product_url):
item = TechbbsItem()
item['pages'] = {}
item['pages']['title'] = title
item['pages']['text'] = text
item['pages']['tags'] = tags
item['pages']['url'] = url
items.append(item)
return items
Any help is greatly appreciated, as this is the last obstacle in my project.
EDIT
items = {'pages':[{'title':title,'text':text,'tags':tags,'url':url}
for title, text, tags, url in zip(product_title, product_text, product_tags, product_url)]}
This extracts the .json in this format
{"pages": [{"title": "x", "text": "x", "tags": "x", "url": "x"}]}
{"pages": [{"title": "x", "text": "x", "tags": "x", "url": "x"}]}
{"pages": [{"title": "x", "text": "x", "tags": "x", "url": "x"}]}
This is getting better but I would still need only one "pages" on the start of the file and everything else inside an array under it.
EDIT 2
I think my spider.py is the reason why "pages" gets added to every line in the .json file and I should have originally posted the whole code of it. Here it is.
# -*- coding: utf-8 -*-
import scrapy
from urllib.parse import urljoin
class TechbbsItem(scrapy.Item):
pages = scrapy.Field()
title = scrapy.Field()
text= scrapy.Field()
tags= scrapy.Field()
url = scrapy.Field()
class TechbbsSpider(scrapy.Spider):
name = 'techbbs'
allowed_domains = ['bbs.io-tech.fi']
start_urls = ['https://bbs.io-tech.fi/forums/prosessorit-emolevyt-ja-muistit.73/?prefix_id=1' #This is a list page full of used pc-part listings
]
def parse(self, response): #This visits product links in the product list page
links = response.css('a.PreviewTooltip::attr(href)').extract()
for l in links:
url = response.urljoin(l)
yield scrapy.Request(url, callback=self.parse_product)
next_page_url = response.xpath('//a[contains(.,"Seuraava ")]/#href').extract_first()
if next_page_url:
next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(url=next_page_url, callback=self.parse)
def parse_product(self, response): #This extracts data from inside the links
product_title = response.xpath('normalize-space(//h1/span/following-sibling::text())').extract()
product_text = response.xpath('//b[contains(.,"Hinta:")]/following-sibling::text()[1]').re('([0-9]+)')
tags = "tags" #This is just a placeholder
product_tags = tags
product_url = response.xpath('//html/head/link[7]/#href').extract()
items = []
for title, text, tags, url in zip(product_title, product_text, product_tags, product_url):
item = TechbbsItem()
item['pages'] = {}
item['pages']['title'] = title
item['pages']['text'] = text
item['pages']['tags'] = tags
item['pages']['url'] = url
items.append(item)
return items
So my spider starts crawling from a page full of product listings. It visits every one of the 50 product links and scrapes 4 items, title, text, tags and url. After scraping every link in one page, it goes to next one and so on. I suspect the loops in the code prevent your suggestions from working for me.
I would like to get the .json export to the exact form mentioned in the original question. Se there would be {"pages": [ on the beginning of the file, then all the indented item lines
{"title": "x", "text": "x", "tags": "x", "url": "x"}, and in the end ]}
In terms of memory usage, it's not a good practice, but an option is to keep an object and write it at the end of the process:
class RautahakuPipeline(object):
def open_spider(self, spider):
self.items = { "pages":[] }
self.file = null # open('items.json', 'w')
def close_spider(self, spider):
self.file = open('items.json', 'w')
self.file.write(json.dumps(self.items))
self.file.close()
def process_item(self, item, spider):
self.items["pages"].append(dict(item))
return item
Then, if memory is an issue (must be treat with attention anyway), try writing the json file as follows:
class RautahakuPipeline(object):
def open_spider(self, spider):
self.file = open('items.json', 'w')
header='{"pages": ['
self.file.write(header)
def close_spider(self, spider):
footer=']}'
self.file.write(footer)
self.file.close()
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
I hope it helps.
Using list comprehension. I donĀ“t know how looks your data , but using a toy example:
product_title = range(1,10)
product_text = range(10,20)
product_tags = range(20,30)
product_url = range(30,40)
item = {'pages':[{'title':title,'text':text,'tags':tags,'url':url}
for title, text, tags, url in zip(product_title, product_text, product_tags, product_url)]}
I get this result:
{'pages': [{'tags': 20, 'text': 10, 'title': 1, 'url': 30},
{'tags': 21, 'text': 11, 'title': 2, 'url': 31},
{'tags': 22, 'text': 12, 'title': 3, 'url': 32},
{'tags': 23, 'text': 13, 'title': 4, 'url': 33},
{'tags': 24, 'text': 14, 'title': 5, 'url': 34},
{'tags': 25, 'text': 15, 'title': 6, 'url': 35},
{'tags': 26, 'text': 16, 'title': 7, 'url': 36},
{'tags': 27, 'text': 17, 'title': 8, 'url': 37},
{'tags': 28, 'text': 18, 'title': 9, 'url': 38}]}
items = {}
#item = TechbbsItem() # not sure what this is doing?
items['pages'] = []
for title, text, tags, url in zip(product_title, product_text, product_tags, product_url):
temp_dict = {}
temp_dict['title'] = title
temp_dict['text'] = text
temp_dict['tags'] = tags
temp_dict['url'] = url
items["pages"].append(temp_dict)
return items
I am new to Python Scrapy and I am trying to create JSON file from 3 levels of nested pages. I have following structure:
Page 1 (start): contains links of second page (called Mangas)
Page 2: Contains nested Volumes and Chapters
Page 3: Each Chapter contains multiple images
My Code
import scrapy
import time
import items
import json
class GmangaSpider(scrapy.Spider):
name = "gmanga"
start_urls = [
"http://gmanga.me/mangas"
]
def parse(self, response):
# mangas = []
for manga in response.css('div.manga-item'):
link = manga.css('a.manga-item-content').xpath('#href').extract_first()
if link:
page_link = "http://gmanga.me%s" % link
mangas = items.Manga()
mangas['cover'] = manga.css('a.manga-item-content .manga-cover-container img').xpath('#src').extract_first()
mangas['title'] = manga.css('a.manga-item-content .manga-cover-container img').xpath('#alt').extract_first()
mangas['link'] = page_link
mangas['volumes'] = []
yield scrapy.Request(page_link, callback=self.parse_volumes, meta = {"mangas": mangas})
def parse_volumes(self, response):
mangas = response.meta['mangas']
for manga in response.css('div.panel'):
volume = items.Volume()
volume['name'] = manga.css('div.panel-heading .panel-title a::text').extract_first()
volume['chapters'] = []
for tr in manga.css('div.panel-collapse .panel-body table tbody tr'):
chapter = items.Chapter()
chapter['name'] = tr.css('td:nth_child(1) div::text').extract_first()
chapter_link = tr.css('td:nth_child(3) a::attr("href")').extract_first()
chapter['link'] = chapter_link
request = scrapy.Request("http://gmanga.me%s" % chapter_link, callback = self.parse_images, meta = {"chapter": chapter})
yield request
volume['chapters'].append(chapter)
mangas['volumes'].append(volume)
yield mangas
def parse_images(self, response):
chapter = response.meta['chapter']
data = response.xpath("//script").re("alphanumSort\((.*])")
if data:
images = json.loads(data[0])
chapter['images'] = images
return chapter
My Items.py
from scrapy import Item, Field
class Manga(Item):
title = Field()
cover = Field()
link = Field()
volumes = Field()
class Volume(Item):
name = Field()
chapters = Field()
class Chapter(Item):
name = Field()
images = Field()
link = Field()
Now I am bit confused in parse_volumes function where to yield or return to get following structure in json file.
Expected Result:
[{
"cover": "http://media.gmanga.me/uploads/manga/cover/151/medium_143061.jpg",
"link": "http://gmanga.me/mangas/gokko",
"volumes": [{
"name": "xyz",
"chapters": [{
"link": "/mangas/gokko/4/3asq",
"name": "4",
"images": ["img1.jpg", "img2.jpg"]
}, {
"link": "/mangas/gokko/3/3asq",
"name": "3",
"images": ["img1.jpg", "img2.jpg"]
}]
}],
"title": "Gokko"
}]
But I am getting images node as separate node it must be within chapters node of volume:
[{
"cover": "http://media.gmanga.me/uploads/manga/cover/10581/medium_I2.5HFzVh7e.png",
"link": "http://gmanga.me/mangas/godess-creation-system",
"volumes": [{
"name": "\u0627\u0644\u0645\u062c\u0644\u062f ",
"chapters": [{
"link": "/mangas/godess-creation-system/1/ayou-cahn",
"name": "1"
}]
}],
"title": "Godess Creation System"
},
{
"images": ["http://media.gmanga.me/uploads/releases/lolly-pop/047-20160111235059UXYGJACW/01.jpg?ak=p0skml", "http://media.gmanga.me/uploads/releases/lolly-pop/047-20160111235059UXYGJACW/02.jpg?ak=p0skml", "http://media.gmanga.me/uploads/releases/lolly-pop/047-20160111235059UXYGJACW/03.jpg?ak=p0skml", "http://media.gmanga.me/uploads/releases/lolly-pop/047-20160111235059UXYGJACW/04.jpg?ak=p0skml"],
"link": "/mangas/reversal/1/Lolly-Pop",
"name": "1"
}]
Each function is individually fetching data properly, the only issue is JSON formation. It is not writing to json file properly. Please lead me where I am wrong.
I'm learning how to work with Scrapy while refreshing my knowledge in Python?/Coding from school.
Currently, I'm playing around with imdb top 250 list but struggling with a JSON output file.
My current code is:
# -*- coding: utf-8 -*-
import scrapy
from top250imdb.items import Top250ImdbItem
class ActorsSpider(scrapy.Spider):
name = "actors"
allowed_domains = ["imdb.com"]
start_urls = ['http://www.imdb.com/chart/top']
# Parsing each movie and preparing the url for the actors list
def parse(self, response):
for film in response.css('.titleColumn'):
url = film.css('a::attr(href)').extract_first()
actors_url = 'http://imdb.com' + url[:17] + 'fullcredits?ref_=tt_cl_sm#cast'
yield scrapy.Request(actors_url, self.parse_actor)
# Finding all actors and storing them on item
# Refer to items.py
def parse_actor(self, response):
final_list = []
item = Top250ImdbItem()
item['poster'] = response.css('#main img::attr(src)').extract_first()
item['title'] = response.css('h3[itemprop~=name] a::text').extract()
item['photo'] = response.css('#fullcredits_content .loadlate::attr(loadlate)').extract()
item['actors'] = response.css('td[itemprop~=actor] span::text').extract()
final_list.append(item)
updated_list = []
for item in final_list:
for i in range(len(item['title'])):
sub_item = {}
sub_item['movie'] = {}
sub_item['movie']['poster'] = [item['poster']]
sub_item['movie']['title'] = [item['title'][i]]
sub_item['movie']['photo'] = [item['photo']]
sub_item['movie']['actors'] = [item['actors']]
updated_list.append(sub_item)
return updated_list
and my output file is giving me this JSON composition:
[
{
"movie": {
"poster": ["https://images-na.ssl-images-amazon.com/poster..."],
"title": ["The Shawshank Redemption"],
"photo": [["https://images-na.ssl-images-amazon.com/photo..."]],
"actors": [["Tim Robbins","Morgan Freeman",...]]}
},{
"movie": {
"poster": ["https://images-na.ssl-images-amazon.com/poster..."],
"title": ["The Godfather"],
"photo": [["https://images-na.ssl-images-amazon.com/photo..."]],
"actors": [["Alexandre Rodrigues", "Leandro Firmino", "Phellipe Haagensen",...]]}
}
]
but I'm looking to achieve this:
{
"movies": [{
"poster": "https://images-na.ssl-images-amazon.com/poster...",
"title": "The Shawshank Redemption",
"actors": [
{"photo": "https://images-na.ssl-images-amazon.com/photo...",
"name": "Tim Robbins"},
{"photo": "https://images-na.ssl-images-amazon.com/photo...",
"name": "Morgan Freeman"},...
]
},{
"poster": "https://images-na.ssl-images-amazon.com/poster...",
"title": "The Godfather",
"actors": [
{"photo": "https://images-na.ssl-images-amazon.com/photo...",
"name": "Marlon Brando"},
{"photo": "https://images-na.ssl-images-amazon.com/photo...",
"name": "Al Pacino"},...
]
}]
}
in my items.py file I have the following:
import scrapy
class Top250ImdbItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
# Items from actors.py
poster = scrapy.Field()
title = scrapy.Field()
photo = scrapy.Field()
actors = scrapy.Field()
movie = scrapy.Field()
pass
I'm aware of the following things:
My results are not coming out in order, the 1st movie on web page list is always the first movie on my output file but the rest is not. I'm still working on that.
I can do the same thing but working with Top250ImdbItem(), still browsing around how that is done in a more detailed way.
This might not be the perfect layout for my JSON, suggestions are welcomed or if it is, let me know, even though I know there is no perfect way or "the only way".
Some actors don't have a photo and it actually loads a different CSS selector. For now, I would like to avoid reaching for the "no picture thumbnail" so it's ok to leave those items empty.
example:
{"photo": "", "name": "Al Pacino"}
Question: ... struggling with a JSON output file
Note: Can't use your ActorsSpider, get Error: Pseudo-elements are not supported.
# Define a `dict` **once**
top250ImdbItem = {'movies': []}
def parse_actor(self, response):
poster = response.css(...
title = response.css(...
photos = response.css(...
actors = response.css(...
# Assuming List of Actors are in sync with List of Photos
actors_list = []
for i, actor in enumerate(actors):
actors_list.append({"name": actor, "photo": photos[i]})
one_movie = {"poster": poster,
"title": title,
"actors": actors_list
}
# Append One Movie to Top250 'movies' List
top250ImdbItem['movies'].append(one_movie)