I am trying to scrape the information pertaining to the biblical commentaries off of a website. Below is the code I have made to do so. start_urls is the link to the json file I am trying to scrape. I chose ['0']['father']['_id'] to get the name of the commenter, however, the following error occurs. What should I do?
Error: TypeError: list indices must be integers or slices, not str
Code:
import scrapy
import json
class catenaspider(scrapy.Spider): #spider to crawl the url
name = 'commentary' #name to be called in command terminal
start_urls = ['https://api.catenabible.com:8080/anc_com/c/mt/1/1?tags=[%22ALL%22]&sort=def']
def parse(self,response):
data = json.loads(response.body)
yield from data['0']['father']['_id']```
Read the documentation again.
import scrapy
class catenaspider(scrapy.Spider): # spider to crawl the url
name = 'commentary' # name to be called in command terminal
start_urls = ['https://api.catenabible.com:8080/anc_com/c/mt/1/1?tags=[%22ALL%22]&sort=def']
def parse(self, response):
data = response.json()
yield {'id_father': data[0]['father']['_id']}
# if you want to get all the id's
# for d in data:
# yield {'id_father': d['father']['_id']}
Related
I am trying to get a json field with key "longName" with scrapy but I am receiving the error: "Spider must return request, item, or None, got 'str'".
The JSON I'm trying to scrape looks something like this:
{
"id":5355,
"code":9594,
}sadsadsd
This is my code:
import scrapy
import json
class NotesSpider(scrapy.Spider):
name = 'notes'
allowed_domains = ['blahblahblah.com']
start_urls = ['https://blahblahblah.com/api/123']
def parse(self, response):
data = json.loads(response.body)
yield from data['longName']
I get the above error when I run "scrapy crawl notes" in prompt. Anyone can point me in the right direction?
If you only want longName modifying your parse method like this should do the trick:
def parse(self, response):
data = json.loads(response.body)
yield {"longName": data["longName"]}
Im new to python and i mix up many example codes for my scrapy but it just show me the first page data for all pages. what is the problem?
My code is:
import scrapy
from scrapy.item import Item, Field
class HotelAbbasiItem(Item):
reviewer=Field()
DateOfReview=Field()
Nationality=Field()
Contribution=Field()
ReviewText=Field()
Rating=Field()
class HotelabbasiSpider(scrapy.Spider):
name = 'HotelAbbasi'
allowed_domains = ['tripadvisor.com']
start_urls = ['https://www.tripadvisor.com/Hotel_Review-g295423-d320767-Reviews-Abbasi_Hotel-Isfahan_Isfahan_Province.html']
def parse(self,response):
items=HotelAbbasiItem()
all_div_parts=response.css('div.hotels-community-tab-common-Card__section--4r93H')
for part in all_div_parts:
reviewer=part.css('a.social-member-event-MemberEventOnObjectBlock__member--35-jC::text').extract()
DateOfReview=part.css('span::text').extract()
Nationality=part.css('span.small::text').extract()
Contribution=part.css('span.social-member-MemberHeaderStats__bold--3z3qh::text').extract()
ReviewText=part.css('q.location-review-review-list-parts-ExpandableReview__reviewText--gOmRC>span::text').extract()
Rating=part.css('div.location-review-review-list-parts-RatingLine__bubbles--GcJvM>span::attr(class)').extract()
items['reviewer']=reviewer
items['DateOfReview']=DateOfReview
items['Nationality']=Nationality
items['Contribution']=Contribution
items['ReviewText']=ReviewText
items['Rating']=Rating
yield items
NextPage=response.css('div.is-centered>a.primary::attr(href)').extract_first()
if NextPage:
NextPage=response.urljoin(NextPage)
yield scrapy.Request(url=NextPage,callback=self.parse)
I am trying to extract and save the image but everytime I run spider I am getting this error , i have defined following functions in items.py
import scrapy
from ..items import HamrobazarItem
class CarsSpider(scrapy.Spider):
name = 'cars'
start_urls = ['https://hamrobazaar.com/c48-automobiles-cars']
def parse(self, response):
items= HamrobazarItem()
img_urls=list()
img_urls.append(response.css('center img::attr(src)').extract())
items['image_urls']=img_urls
yield items
import scrapy
class HamrobazarItem(scrapy.Item):
images=scrapy.Field()
image_urls=scrapy.Field()
pass
I couldn't run your spider but it seems that the problem is yielding list of lists. response.css('center img::attr(src)').extract() is a list and img_urls.append(response.css('center img::attr(src)').extract()) is a list of list so changing it to img_urls = response.css('center img::attr(src)').extract() may solve your problem.
I need to scrape this site.
Is made in React so it looks. Then I tried to extract the data with scrapy-splash. I need for example the "a" element with class shelf-product-name. But the response is an empty array. I used the wait argument in about 5 seconds.
But I only get an empty array.
def start_requests(self):
yield SplashRequest(
url='https://www.jumbo.cl/lacteos-y-bebidas-vegetales/leches-blancas?page=6',
callback=self.parse,
args={'wait':5}
)
def parse(self,response):
print(response.css("a.shelf-product-name"))
Actually there is no need to use Scrapy Splash because all required data stored inside <script> tag of raw html response as json formatted data:
import scrapy
from scrapy.crawler import CrawlerProcess
import json
class JumboCLSpider(scrapy.Spider):
name = "JumboCl"
start_urls = ["https://www.jumbo.cl/lacteos-y-bebidas-vegetales/leches-blancas?page=6"]
def parse(self,response):
script = [script for script in response.css("script::text") if "window.__renderData" in script.extract()]
if script:
script = script[0]
data = script.extract().split("window.__renderData = ")[-1]
json_data = json.loads(data[:-1])
for plp in json_data["plp"]["plp_products"]:
for product in plp["data"]:
#yield {"productName":product["productName"]} # data from css: a.shelf-product-name
yield product
if __name__ == "__main__":
c = CrawlerProcess({'USER_AGENT':'Mozilla/5.0'})
c.crawl(JumboCLSpider)
c.start()
I've written a script in python scrapy to parse name and prices of different items available in a webpage. I tried to implement logic in my script the way I've learnt so far. However, when I execute it, I get the following error. I suppose I can't make the callback method work properly. Here is the script I've tried with:
The spider names "sth.py" contains:
from scrapy.contrib.spiders import CrawlSpider
from scrapy.http.request import Request
class SephoraSpider(CrawlSpider):
name = "sephorasp"
def start_requests(self):
yield Request(url = "https://www.sephora.ae/en/stores/", callback = self.parse_pages)
def parse_pages(self, response):
for link in response.xpath('//ul[#class="nav-primary"]//a[contains(#class,"level0")]/#href').extract():
yield Request(url = link, callback = self.parse_inner_pages)
def parse_inner_pages(self, response):
for links in response.xpath('//li[contains(#class,"amshopby-cat")]/a/#href').extract():
yield Request(url = links, callback = self.target_page)
def target_page(self, response):
for titles in response.xpath('//div[#class="product-info"]'):
product = titles.xpath('.//div[contains(#class,"product-name")]/a/text()').extract_first()
rate = titles.xpath('.//span[#class="price"]/text()').extract_first()
yield {'Name':product,'Price':rate}
"items.py" contains:
import scrapy
class SephoraItem(scrapy.Item):
Name = scrapy.Field()
Price = scrapy.Field()
Partial error looks like:
if cookie.secure and request.type != "https":
AttributeError: 'WrappedRequest' object has no attribute 'type'
Here is the total error log:
"https://www.dropbox.com/s/kguw8174ye6p3q9/output.log?dl=0"
Looks like you are running scrapy v1.1 when the current release is v1.4. As far as I remember there was a bug regarding some early 1.something version and WrappedRequest object used for handling cookies.
Try upgrading to v1.4:
pip install scrapy --upgrade