Clicking on read more by using scrapy - python

I have been working on a scrapy spider that is now capable of scraping tripadvisor.com reviews. However, to extract the trip type I need to get my spider to press on 'read more...'. Otherwise the coding does not show up. For the first page (the first 5 reviews), the trip types are still hidden in the code, but by going to the next review page the code completely disappears and it only returns N/A. I have been seeing some previous questions being answered by using Selenium, but after trying I haven't been able to get it working. My code is as follows (without any coding from selenium). Thanks in advance!
# -*- coding: utf-8 -*-
import scrapy
import lxml.html
class TripreviewsSpider(scrapy.Spider):
name = 'tripreviews'
allowed_domains = ['tripadvisor.com']
start_urls = ['https://www.tripadvisor.com/Hotels-g60763-New_York_City_New_York-Hotels.html']
### Listing page crawling
def parse_listing(self, response):
urls = response.css('a.property_title.prominent::attr(href)').extract()
for urlHotel in urls:
print(urlHotel)
urlHotel = response.urljoin(urlHotel)
yield scrapy.Request(url = urlHotel ,callback=self.parse) ###If scrapy recognises a detail page it opens it and will do a callback to line 23 (parse_review).
next_listing_url = response.css('a.nav.next.ui_button.primary.cx_brand_refresh_phase2::attr(href)').extract() ### If scrapy does not recognise a detail page, it will go to the next listing page and starts at line 11 again.
next_listing_url = response.urljoin(next_listing_url)
yield scrapy.Request(url=next_listing_url, callback=self.parse_listing)
def parse(self, response):
self.log('I just visited:' + response.url)
#Set up default value
defaultValue = "N/A"
#Narrow review dataset
foundReviews = response.css('div[data-test-target="reviews-tab"] > div[data-test-target="HR_CC_CARD"]')
#Narrow hoteldetails dataset
foundHotelDetails = response.css("div[class*='ui_column is-12-tablet is-9-mobile hotels-hotel-review-atf-info-parts-ATFInfo__description--']")
#listingName
listingName = []
for hotel in foundHotelDetails:
containslistingName = hotel.css("div[class*='ui_column is-12-tablet is-9-mobile hotels-hotel-review-atf-info-parts-ATFInfo__description--'] > div > h1#HEADING::text").extract()
listingName.append(containslistingName or defaultValue)
listingRating = []
for ratingAverage in foundHotelDetails.css("a[class*='hotels-hotel-review-atf-info-parts-Rating__ratingsAnchor--'] > span[class*='bubble']").xpath("#class").extract():
#For each found rating: strip everything off (extra classes and the prefix 'bubble_') to get the rating value.
#Divide it by 10 to get the proper rating.
listingRating.append( int(ratingAverage.split(" ")[1].split("bubble_")[1]) / 10 )
listingCategory = []
for category in foundHotelDetails:
containsCategory = category.css("div[class*='hotels-hotel-review-atf-info-parts-PopIndex__popIndex--'] > span > a::text").get()
listingCategory.append(containsCategory.split(" ")[0] or defaultValue)
#Build list of review attributes using narrowed dataset
reviewTitle = foundReviews.css('div[data-test-target="review-title"] > a > span > span::text').extract()
#Clean up html tags in review text
formattedReviewText = []
for text in foundReviews:
#For each entry: check if it has an element containing the trip type
foundText = str(text.css("q[class*='location-review-review-list-parts-ExpandableReview__reviewText']").get())
formattedText = lxml.html.fromstring(foundText).text_content()
#Add the type if it was found in text, or add the default value
formattedReviewText.append(formattedText or defaultValue)
reviewDateofStay = foundReviews.css("span[class*='location-review-review-list-parts-EventDate__event_date']::text").extract()
#Because the origin country of the reviewer isn't always present, create a specific list.
#This list makes sure that if trip type is present, it will be added to the corresponding review.
#Otherwise it's defaultValue
#That way the list lengths will always match.
reviewLocation = []
for location in foundReviews:
#For each entry: check if it has an element containing the trip type
containsReviewerOrigin = location.css("span[class*='default social-member-common-MemberHometown__hometown']::text").get()
#Add the type if it was found in text, or add the default value
reviewLocation.append(containsReviewerOrigin or defaultValue)
#Because ratings are immediately turned into a css class, you cannot extract this immediately.
#You need to strip the exact class out and then divide by 10 so you can get the rating in a scale from 1.0 to 5.0.
#Ratings are also optional....
reviewRating = []
for rating in foundReviews.css("div[data-test-target='review-rating'] > span[class*='bubble']").xpath("#class").extract():
#For each found rating: strip everything off (extra classes and the prefix 'bubble_') to get the rating value.
#Divide it by 10 to get the proper rating.
reviewRating.append( int(rating.split(" ")[1].split("bubble_")[1]) / 10 )
#Because trip type isn't always present, create a specific list.
#This list makes sure that if trip type is present, it will be added to the corresponding review.
#Otherwise it's defaultValue
#That way the list lengths will always match.
reviewTripType = []
for review in foundReviews:
#For each entry: check if it has an element containing the trip type
containsTripType = review.css("span[class*='location-review-review-list-parts-TripType__trip_type--']::text").get()
#Add the type if it was found in text, or add the default value
reviewTripType.append(containsTripType or defaultValue)
reviewHelpfulVotes = []
for helpfulVotes in foundReviews:
#For each entry: check if it has an element containing the trip type
hasHelpfulVotes = helpfulVotes.css("div[class*='location-review-social-bar-SocialBar__bar--'] > span[class*='social-statistics-bar-SocialStatisticsBar__bar--'] > span[class*='social-statistics-bar-SocialStatisticsBar__counts']::text").get()
#Add the type if it was found in text, or add the default value. Check if hasHelpfulVotes isn't Nonetype.
#And split value on first space and get first element because default value does not contain spaces it will never look weird and will always make sure that validHelpfulVotes values are always split correctly.
validHelpfulVote = hasHelpfulVotes or defaultValue
reviewHelpfulVotes.append(validHelpfulVote.split(" ")[0])
reviewPrefix = {
'name': listingName,
'listing Rating': listingRating,
'category': listingCategory
}
#Create the review data
for review in zip(reviewLocation, reviewTitle, formattedReviewText, reviewRating, reviewDateofStay, reviewTripType, reviewHelpfulVotes):
reviewData = {
'Origin of reviewer': review[0],
'Title': review[1],
'ReviewText': review[2],
'Rating': review[3] or defaultValue,
'Date of stay': review[4] or defaultValue,
'Trip type': review[5],
'Helpful votes': review[6]
}
reviewTotal = {**reviewPrefix,**reviewData}
yield reviewTotal
### Next review page
nextReviewPage = response.css('a.ui_button.nav.next.primary::attr(href)').extract_first()
if nextReviewPage:
nextReviewPage = response.urljoin(nextReviewPage)
print(nextReviewPage) #For checking purposes
yield scrapy.Request(url = nextReviewPage, callback = self.parse)
yield scrapy.Request(url = 'https://www.tripadvisor.com/Hotels-g60763-New_York_City_New_York-Hotels.html', callback = self.parse_listing)

Related

Comparing results with Beautiful Soup in Python

I've got the following code that filters a particular search on an auction site.
I can display the titles of each value & also the len of all returned values:
from bs4 import BeautifulSoup
import requests
url = requests.get("https://www.trademe.co.nz/a/marketplace/music-instruments/instruments/guitar-bass/electric-guitars/search?search_string=prs&condition=used")
soup = BeautifulSoup(url.text, "html.parser")
listings = soup.findAll("div", attrs={"class":"tm-marketplace-search-card__title"})
print(len(listings))
for listing in listings:
print(listing.text)
This prints out the following:
#print(len(listings))
3
#for listing in listings:
# print(listing.text)
PRS. Ten Top Custom 24, faded Denim, Piezo.
PRS SE CUSTOM 22
PRS Tremonti SE *With Seymour Duncan Pickups*
I know what I want to do next, but don't know how to code it. Basically I want to only display new results. I was thinking storing the len of the listings (3 at the moment) as a variable & then comparing that with another GET request (2nd variable) that maybe runs first thing in the morning. Alternatively compare both text values instead of the len. If it doesn't match, then it shows the new listings. Is there a better or different way to do this? Any help appreciated thank you
With length-comparison, there is the issue of some results being removed between checks, so it might look like there are no new results even if there are; and text-comparison does not account for results with similar titles.
I can suggest 3 other methods. (The 3rd uses my preferred approach.)
Closing time
A comment suggested using the closing time, which can be found in the tag before the title; you can define a function to get the days until closing
from datetime import date
import dateutil.parser
def get_days_til_closing(lSoup):
cTxt = lSoup.previous_sibling.find('div', {'tmid':'closingtime'}).text
cTime = dateutil.parser.parse(cTxt.replace('Closes:', '').strip())
return (cTime.date() - date.today()).days
and then filter by the returned value
min_dtc = 3 # or as preferred
# your current code upto listings = soup.findAll....
new_listings = [l for l in listings if get_days_til_closing(l) > min_dtc]
print(len(new_listings), f'new listings [of {len(listings)}]')
for listing in new_listings: print(listing.text)
However, I don't know if sellers are allowed to set their own closing times or if they're set at a fixed offset; also, I don't see the closing time text when inspecting with the browser dev tools [even though I could extract it with the code above], and that makes me a bit unsure of whether it's always available.
JSON list of Listing IDs
Each result is in a "card" with a link to the relevant listing, and that link contains a number that I'm calling the "listing ID". You can save that in a list as a JSON file and keep checking against it every new scrape
from bs4 import BeautifulSoup
import requests
import json
lFilename = 'listing_ids.json' # or as preferred
url = requests.get("https://www.trademe.co.nz/a/marketplace/music-instruments/instruments/guitar-bass/electric-guitars/search?search_string=prs&condition=used")
try:
prev_listings = json.load(open(lFilename, 'r'))
except Exception as e:
prev_listings = []
print(len(prev_listings), 'saved listings found')
soup = BeautifulSoup(url.text, "html.parser")
listings = soup.select("div.o-card > a[href*='/listing/']")
new_listings = [
l for l in listings if
l.get('href').split('/listing/')[1].split('?')[0]
not in prev_listings
]
print(len(new_listings), f'new listings [of {len(listings)}]')
for listing in new_listings:
print(listing.select_one('div.tm-marketplace-search-card__title').text)
with open(lFilename, 'w') as f:
json.dump(prev_listings + [
l.get('href').split('/listing/')[1].split('?')[0]
for l in new_listings
], f)
This should be fairly reliable as long as they don't tend to recycle the listing ids, this should be fairly reliable. (Even then, every once in a while, after checking the new listings for that day, you can just delete the JSON file and re-run the program once; it will also keep the file from getting too big...)
CSV Logging [including Listing IDs]
Instead of just saving the IDs, you can save pretty much all the details from each result
from bs4 import BeautifulSoup
import requests
from datetime import date
import pandas
lFilename = 'listings.csv' # or as preferred
max_days = 60 # or as preferred
date_today = date.today()
url = requests.get("https://www.trademe.co.nz/a/marketplace/music-instruments/instruments/guitar-bass/electric-guitars/search?search_string=prs&condition=used")
try:
prev_listings = pandas.read_csv(lFilename).to_dict(orient='records')
prevIds = [str(l['listing_id']) for l in prev_listings]
except Exception as e:
prev_listings, prevIds = [], []
print(len(prev_listings), 'saved listings found')
def get_listing_details(lSoup, prevList, lDate=date_today):
selectorsRef = {
'title': 'div.tm-marketplace-search-card__title',
'location_time': 'div.tm-marketplace-search-card__location-and-time',
'footer': 'div.tm-marketplace-search-card__footer',
}
lId = lSoup.get('href').split('/listing/')[1].split('?')[0]
lDets = {'listing_id': lId}
for k, sel in selectorsRef.items():
s = lSoup.select_one(sel)
lDets[k] = None if s is None else s.text
lDets['listing_link'] = 'https://www.trademe.co.nz/a/' + lSoup.get('href')
lDets['new_listing'] = lId not in prevList
lDets['last_scraped'] = lDate.isoformat()
return lDets
soup = BeautifulSoup(url.text, "html.parser")
listings = [
get_listing_details(s, prevIds) for s in
soup.select("div.o-card > a[href*='/listing/']")
]
todaysIds = [l['listing_id'] for l in listings]
new_listings = [l for l in listings if l['new_listing']]
print(len(new_listings), f'new listings [of {len(listings)}]')
for listing in new_listings: print(listing['title'])
prev_listings = [
p for p in prev_listings if str(p['listing_id']) not in todaysIds
and (date_today - date.fromisoformat(p['last_scraped'])).days < max_days
]
pandas.DataFrame(prev_listings + listings).to_csv(lFilename, index=False)
You'll end up with a spreadsheet of scraping history/log that you can check anytime, and depending on what you set max_days to, the oldest data will be automatically cleared.
Fixed it with the following:
allGuitars = ["",]
latestGuitar = soup.select("#-title")[0].text.strip()
if latestGuitar in allGuitars[0]:
print("No change. The latest listing is still: " + allGuitars[0])
elif not latestGuitar in allGuitars[0]:
print("New listing detected! - " + latestGuitar)
allGuitars.clear()
allGuitars.insert(0, latestGuitar)

BeautifulSoup - how to get items from website, not containing div, as rest of the items

I am trying to scrape job ads from website : https://www.jobs.bg/front_job_search.php?frompage=0&add_sh=1&categories%5B0%5D=29&location_sid=1&keywords%5B0%5D=python&term=#paging
I want to get all visible data - job title, location, short description such as : Full Stack; DBA, Big Data; Data Science, AI, ML and Embedded; Test, QA and scraping part for this is:
result = requests.get("https://www.jobs.bg/front_job_search.php?frompage=0&add_sh=1&categories%5B0%5D=29&location_sid=1&keywords%5B0%5D=python&term=#paging").text
soup = bs4.BeautifulSoup(result, "lxml")
jobs = soup.find_all('td', class_ = "offerslistRow")
for job in jobs:
description = find_all('div', class_="card__subtitle mdc-typography mdc-typography--body2")
and it is [0] part to be precise, as there are two type short descriptions with same class name, but this is not the issue.
Some ads don't have short description, but they also don't have the mentioned div part(it is not empty, it doesn't exist at all).
Is there a way to get description for such ads as well as "N/A" for example or something like that?
I'm assuming you want to scrape all job details as the question was a bit unclear. I have made a few other changes to your code as well, and handled all possible cases.
The following code should do the job-
import bs4
import requests
result = requests.get("https://www.jobs.bg/front_job_search.php?frompage=0&add_sh=1&categories%5B0%5D=29&location_sid=1&keywords%5B0%5D=python&term=#paging").text
soup = bs4.BeautifulSoup(result, "lxml")
# find all jobs
jobs = soup.find_all('td', class_ = "offerslistRow")
# list to store job title
job_title=[]
# list to store job location
job_location=[]
# list to store domain and skills
domain_and_skills=[]
# loop through the jobs
for job in jobs:
# this check is to remove the other two blocks aligned to the right
if job.find('a',class_="card__title mdc-typography mdc-typography--headline6 text-overflow") is not None:
# find and append job name
job_name=job.find('a',class_="card__title mdc-typography mdc-typography--headline6 text-overflow")
job_title.append(job_name.text)
# find and append location and salary description
location_salary_desc=job.find('span',class_='card__subtitle mdc-typography mdc-typography--body2 top-margin')
if location_salary_desc is not None:
job_location.append(location_salary_desc.text.strip())
else:
job_location.append('NA')
# find other two descriptions (Skills and domains)
description = job.find_all(class_="card__subtitle mdc-typography mdc-typography--body2")
# if both are empty (len=0)
if len(description)==0:
domain_and_skills.append('NA')
# if len=1 (can either be skills or domain details)
elif len(description)==1:
# to check if domain is present and skills is empty
if description[0].find('div') is None:
domain_and_skills.append(description[0].text.strip())
# domain is empty and skills is present
else:
# list to store skills
skills=[]
# find all images in skills section and get alt attribute which contains skill name
images=description[0].find_all('img')
# if no image and only text is present (for example Shell Scripts is not an image, contains text value)
if len(images)==0:
skills.append(description[0].text.strip())
# both image and text is present
else:
# for each image, append skill name in list
for image in images:
skills.append(image['alt'])
# append text to list if not empty
if description[0].text.strip() !='':
skills.append(description[0].text.strip())
#convert list to string
skills_string = ','.join([str(skill) for skill in skills])
domain_and_skills.append(skills_string)
# both domain and skills are present
else:
domain_string=description[0].text.strip()
# similar procedure as above to print skill names
skills=[]
images=description[1].find_all('img')
if len(images)==0:
skills.append(description[1].text.strip())
else:
for image in images:
skills.append(image['alt'])
if description[1].text.strip() !='':
skills.append(description[1].text.strip())
skills_string = ','.join([str(skill) for skill in skills])
#combine domain and skills
domain_string=domain_string+','+skills_string
domain_and_skills.append(domain_string)
for i in range(0,len(job_title)):
print(job_title[i])
print(job_location[i])
print(domain_and_skills[i])

How to scrape RottenTomatoes Audience Reviews using Python?

I'm creating a spider using scrapy to scrape Details from rottentomatoes.com. As the search page is rendered dynamically, I used the rottentomatoes API for eg:https://www.rottentomatoes.com/api/private/v2.0/search?q=inception to get the search results and URL. Following the URL via scrapy, I was able to extract the tomatometer score, audience score, director, cast etc. However, I want to extract all the audience reviews too. The issue is that, audience reviews page (https://www.rottentomatoes.com/m/inception/reviews?type=user) works using pagination and I'm not able to extract data from next page, moreover I couldn't find a way to use the API to extract the details either. Could anyone help me on this.
def parseRottenDetail(self, response):
print("Reached Tomato Parser")
try:
if MoviecrawlSpider.current_parse <= MoviecrawlSpider.total_results:
items = TomatoCrawlerItem()
MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse]['tomatometerScore'] = response.css(
'.mop-ratings-wrap__row .mop-ratings-wrap__half .mop-ratings-wrap__percentage::text').get().strip()
MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse][
'tomatoAudienceScore'] = response.css(
'.mop-ratings-wrap__row .mop-ratings-wrap__half.audience-score .mop-ratings-wrap__percentage::text').get().strip()
MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse][
'tomatoCriticConsensus'] = response.css('p.mop-ratings-wrap__text--concensus::text').get()
if MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse]["type"] == "Movie":
MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse]['Director'] = response.xpath(
"//ul[#class='content-meta info']/li[#class='meta-row clearfix']/div[contains(text(),'Directed By')]/../div[#class='meta-value']/a/text()").get()
else:
MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse]['Director'] = response.xpath(
"//div[#class='tv-series__series-info-castCrew']/div/span[contains(text(),'Creator')]/../a/text()").get()
reviews_page = response.css('div.mop-audience-reviews__view-all a[href*="reviews"]::attr(href)').get()
if len(reviews_page) != 0:
yield response.follow(reviews_page, callback=self.parseRottenReviews)
else:
for key in MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse].keys():
if "pageURL" not in key and "type" not in key:
items[key] = MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse][key]
yield items
if MoviecrawlSpider.current_parse <= MoviecrawlSpider.total_results:
MoviecrawlSpider.current_parse += 1
print("Parse Values are Current Parse " + str(
MoviecrawlSpider.current_parse) + "and Total Results " + str(MoviecrawlSpider.total_results))
yield response.follow(MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse]["pageURL"],
callback=self.parseRottenDetail)
except Exception as e:
exc_type, exc_obj, exc_tb = sys.exc_info()
print(e)
print(exc_tb.tb_lineno)
After this piece of code is executed I reach the page of reviews ie for eg: https://www.rottentomatoes.com/m/inception/reviews?type=user, Hereafter there is a next button and next page is loaded using pagination. So What should be my approach to extract all the reviews?
def parseRottenReviews(self, response):
print("Reached Rotten Review Parser")
items = TomatoCrawlerItem()
When you go to the next page, you can notice that it uses the previous end cursor value of the page. You can set the endCursor with empty string for the first iteration. Also note that you need the movieId for requuesting reviews, this id can be extracted from the json embedded from JS :
import requests
import re
import json
r = requests.get("https://www.rottentomatoes.com/m/inception/reviews?type=user")
data = json.loads(re.search('movieReview\s=\s(.*);', r.text).group(1))
movieId = data["movieId"]
def getReviews(endCursor):
r = requests.get(f"https://www.rottentomatoes.com/napi/movie/{movieId}/reviews/user",
params = {
"direction": "next",
"endCursor": endCursor,
"startCursor": ""
})
return r.json()
reviews = []
result = {}
for i in range(0, 5):
print(f"[{i}] request review")
result = getReviews(result["pageInfo"]["endCursor"] if i != 0 else "")
reviews.extend([t for t in result["reviews"]])
print(reviews)
print(f"got {len(reviews)} reviews")
Note that you can also scrape the html for the first iteration
As I'm using Scrapy, I was looking for a way to perform this without using the requests module. The approach is the same, but I found that the page https://www.rottentomatoes.com/m/inception had an object root.RottenTomatoes.context.fandangoData in the <script> tag, which had a key "emsId" which had the Id of the movie which is passed to the api to get details. Going through the network tab on each pagination event, I realised that they used startCursor and endCursor to filter results for each page.
pattern = r'\broot.RottenTomatoes.context.fandangoData\s*=\s*(\{.*?\})\s*;\s*\n'
json_data = response.css('script::text').re_first(pattern)
movie_id = json.loads(json_data)["emsId"]
{SpiderClass}.movieId = movie_id
next_page = "https://www.rottentomatoes.com/napi/movie/" + movie_id + "/reviews/user?direction=next&endCursor=&startCursor="
yield response.follow(next_page, callback=self.parseRottenReviews)
For the first iteration, you could leave the startCursor and endCursor parameter blank. Now you enter the parse function. You could get the startCursor and endCursor parameters of the next page from the current response. Repeat this until the hasNextPage attribute is false.
def parseRottenReviews(self, response):
print("Reached Rotten Review Parser")
current_result = json.loads(response.text)
for review in current_result["reviews"]:
{SpiderClass}.reviews.append(review) #Spider class memeber So that it could be shared among iterations
if current_result["pageInfo"]["hasNextPage"] is True:
next_page = "https://www.rottentomatoes.com/napi/movie/" + \
str({SpiderClass}.movieId) + "/reviews/user?direction=next&endCursor=" + str(
current_result["pageInfo"][
"endCursor"]) + "&startCursor=" + str(current_result["pageInfo"]["startCursor"])
yield response.follow(next_page, callback=self.parseRottenReviews)
Now the {SpiderClass}.reviews array will have the reviews

Web Scraping - I cannot use a for loop to list element

i am currently building a web scraper and i encountered a problem.
When i try to build my for loop in order to regroup all information by company the extraction keeps on showing all elements of the same type together.
When I realized that it didnt work i went back and tried to show an index list of only the first element but even when I type [0] all the elements are shown to me as if no specific selection was made
import scrapy
from centech.items import CentechItem
class CentechSpiderSpider(scrapy.Spider):
name = 'centech_spider'
start_urls = ['https://centech.co/nos-entreprises/']
def parse(self, response):
items = CentechItem()
all_companies = response.xpath("//div[#class = 'fl-post-carousel-
post']")[1] # "//div[#class = 'fl-post-carousel-post']")[1]
Nom = all_companies.xpath("//h2[contains(#class, 'fl-post-carousel-
title')]/text()").extract()
Description = all_companies.xpath("//div[contains(#class,
'description')]/p/text()").extract()
# Nom = all_companies.response.css("h2.fl-post-carousel-
title::text").extract()
# Description = all_companies.xpath("p::text").extract()
yield {'Nom' : Nom ,
'Description' : Description ,
}
I expect to see only the first element of the page but all the entreprises are shown.
Thank you.
I'm not quite sure about the output you wish to have. I took a guess and modified your script to grab the following results. You need to go one layer deep to fetch the full description as some of the description are broken:
import scrapy
class CentechSpiderSpider(scrapy.Spider):
name = 'centech_spider'
start_urls = ['https://centech.co/nos-entreprises/']
def parse(self, response):
for item in response.css("a.fl-post-carousel-link"):
nom = item.css(".description > h2.fl-post-carousel-title::text").get()
description = item.css(".description > p::text").get()
yield {'nom':nom,'description':description}

Scraping Multi level data using Scrapy, optimum way

I have been wondering what would be the best way to scrap the multi level of data using scrapy
I will describe the situation in four stage,
current architecture that i am following to scrape this data
basic code structure
the difficulties and why i think there has to be a better option
The format in which i have tried to store the data and failed and then succeeded partially
Current Architecture
the data structure
First page : List of Artist
Second page : List of Album for each Artist
Third Page : list of Songs for each Album
basic code structure
class MusicLibrary(Spider):
name = 'MusicLibrary'
def parse(self, response):
items = Discography()
items['artists'] = []
for artist in artists:
item = Artist()
item['albums'] = []
item['artist_name'] = "name"
items['artists'].append(item)
album_page_url = "extract link to album and yield that page"
yield Request(album_page_url,
callback=self.parse_album,
meta={'item': items,
'artist_name': item['artist_name']})
def parse_album(self, response):
base_item = response.meta['item']
artist_name = response.meta['artist_name']
# this will search for the artist added in previous method and append album under that artist
artist_index = self.get_artist_index(base_item['artists'], artist_name)
albums = "some path selector"
for album in albums:
item = Album()
item['songs'] = []
item['album_name'] = "name"
base_item['artists'][artist_index]['albums'].append(item)
song_page_url = "extract link to song and yield that page"
yield Request(song_page_url,
callback=self.parse_song_name,
meta={'item': base_item,
"key": item['album_name'],
'artist_index': artist_index})
def parse_song_name(self, response):
base_item = response.meta['item']
album_name = response.meta['key']
artist_index = response.meta["artist_index"]
album_index = self.search(base_item['artists'][artist_index]['albums'], album_name)
songs = "some path selector "
for song in songs:
item = Song()
song_name = "song name"
base_item['artists'][artist_index]['albums'][album_index]['songs'].append(item)
# total_count (total songs to parse) = Main Artist page is having the list of total songs for each artist
# current_count(currently parsed) = i will go to each artist->album->songs->[] and count the length
# i will yield the base_item only when songs to scrape and song scraped count matches
if current_count == total_count:
yield base_item
the difficulties and why i think there has to be a better option
currently i am yielding item object only when all the pages and sub-pages are scraped with condition that the songs to scrape and song scraped count matches..
but give the nature of scraping and volume of scraping ...there are some pages which are to give me code other than (200-status ok) and those songs will not be scraped and item count will not match
so at the end, when even though 90% pages will be scraped successfully and count will not match nothing will be yielded and all CPU power will be lost..
The format in which i have tried to store the data and failed and then succeeded partially
i wanted the data for each item object in single line format
i.e. artistName-Albumname-song name
so if artist A has 1 album (aa) with 8 song ... 8 items will be stores with one entry(item) per song
but with the current format when i have tried yielding every time in last function "parse_song_name" it was yielding that complex structure every time and object was incremental every time...
then i thought the appending everything in first Discography->artist then Artist->albums and then Albums->songs was the problem but when i have removed appending and tried without that i was only yielding one object which is the last one not all..
so finally , developed this work around as described before but it does not work every time ( in case of no 200 status code)
and when it work , after yielding , i have written a pipline where i parse this jSON again and store it in the data format i initially wanted ( one line for each song --flat structure)
can anyone suggest what i am doing wrong here or how can i make this more efficient and make work when some of the pages return non 200 code?
The problem with the code above was:
Mutable object ( list, dict) : and all the callbacks were changing that same object in each loop hence ...first and second level of data was being overwritten in last third loop ( mp3_son_url) ...(this was my failed attempt)
the solution was to use simple copy.deepcopy and create a new object from response.meta object in callback method and not change the base_item object
will try to explain the full answer when i get some time..

Categories

Resources