I have been wondering what would be the best way to scrap the multi level of data using scrapy
I will describe the situation in four stage,
current architecture that i am following to scrape this data
basic code structure
the difficulties and why i think there has to be a better option
The format in which i have tried to store the data and failed and then succeeded partially
Current Architecture
the data structure
First page : List of Artist
Second page : List of Album for each Artist
Third Page : list of Songs for each Album
basic code structure
class MusicLibrary(Spider):
name = 'MusicLibrary'
def parse(self, response):
items = Discography()
items['artists'] = []
for artist in artists:
item = Artist()
item['albums'] = []
item['artist_name'] = "name"
items['artists'].append(item)
album_page_url = "extract link to album and yield that page"
yield Request(album_page_url,
callback=self.parse_album,
meta={'item': items,
'artist_name': item['artist_name']})
def parse_album(self, response):
base_item = response.meta['item']
artist_name = response.meta['artist_name']
# this will search for the artist added in previous method and append album under that artist
artist_index = self.get_artist_index(base_item['artists'], artist_name)
albums = "some path selector"
for album in albums:
item = Album()
item['songs'] = []
item['album_name'] = "name"
base_item['artists'][artist_index]['albums'].append(item)
song_page_url = "extract link to song and yield that page"
yield Request(song_page_url,
callback=self.parse_song_name,
meta={'item': base_item,
"key": item['album_name'],
'artist_index': artist_index})
def parse_song_name(self, response):
base_item = response.meta['item']
album_name = response.meta['key']
artist_index = response.meta["artist_index"]
album_index = self.search(base_item['artists'][artist_index]['albums'], album_name)
songs = "some path selector "
for song in songs:
item = Song()
song_name = "song name"
base_item['artists'][artist_index]['albums'][album_index]['songs'].append(item)
# total_count (total songs to parse) = Main Artist page is having the list of total songs for each artist
# current_count(currently parsed) = i will go to each artist->album->songs->[] and count the length
# i will yield the base_item only when songs to scrape and song scraped count matches
if current_count == total_count:
yield base_item
the difficulties and why i think there has to be a better option
currently i am yielding item object only when all the pages and sub-pages are scraped with condition that the songs to scrape and song scraped count matches..
but give the nature of scraping and volume of scraping ...there are some pages which are to give me code other than (200-status ok) and those songs will not be scraped and item count will not match
so at the end, when even though 90% pages will be scraped successfully and count will not match nothing will be yielded and all CPU power will be lost..
The format in which i have tried to store the data and failed and then succeeded partially
i wanted the data for each item object in single line format
i.e. artistName-Albumname-song name
so if artist A has 1 album (aa) with 8 song ... 8 items will be stores with one entry(item) per song
but with the current format when i have tried yielding every time in last function "parse_song_name" it was yielding that complex structure every time and object was incremental every time...
then i thought the appending everything in first Discography->artist then Artist->albums and then Albums->songs was the problem but when i have removed appending and tried without that i was only yielding one object which is the last one not all..
so finally , developed this work around as described before but it does not work every time ( in case of no 200 status code)
and when it work , after yielding , i have written a pipline where i parse this jSON again and store it in the data format i initially wanted ( one line for each song --flat structure)
can anyone suggest what i am doing wrong here or how can i make this more efficient and make work when some of the pages return non 200 code?
The problem with the code above was:
Mutable object ( list, dict) : and all the callbacks were changing that same object in each loop hence ...first and second level of data was being overwritten in last third loop ( mp3_son_url) ...(this was my failed attempt)
the solution was to use simple copy.deepcopy and create a new object from response.meta object in callback method and not change the base_item object
will try to explain the full answer when i get some time..
Related
How can I go about parsing data for one variable directly from the start url and data for other variables after following all the href from the start url?
The web page I want to scrape has a list of articles with the "category", "title", "content", "author" and "date" data. In order to scrape the data, I followed all "href" on the start url which redirect to the full article and parsed the data. However, the "category" data is not always available when individual article is opened/followed from the "href" in the start url so it ends up having missing data for some observations. Now, I'm trying to scrape just the "category" data directly from the start url which has the "category" data for all article listings (no missing data). How should I go about parsing "category" data? How should I take care of the parsing and callback? The "category" data is circle in red in the image
class BtcNewsletterSpider(scrapy.Spider):
name = 'btc_spider'
allowed_domains = ['www.coindesk.com']
start_urls = ['https://www.coindesk.com/tag/bitcoin/1/']
def parse(self, response):
for link in response.css('.card-title'):
yield response.follow(link, callback=self.parse_newsletter)
def parse_newsletter(self, response):
item = CoindeskItem()
item['category'] = response.css('.kjyoaM::text').get()
item['headline'] = response.css('.fPbJUO::text').get()
item['content_summary'] = response.css('.jPQVef::text').get()
item['authors'] = response.css('.dWocII a::text').getall()
item['published_date'] = response.css(
'.label-with-icon .fUOSEs::text').get()
yield item
You can use the cb_kwargs argument to pass data from one parse callback to another. To do this you would need to grab the value of the category for the corresponding link to the full article. This can be done by simply iterating through any element that encompasses both the category and the link and pulling the information from both out of said element.
Here is an example based on the code you provided, this should work the way you described.
class BtcNewsletterSpider(scrapy.Spider):
name = 'btc_spider'
allowed_domains = ['www.coindesk.com']
start_urls = ['https://www.coindesk.com/tag/bitcoin/1/']
def parse(self, response):
for card in response.xpath("//div[contains(#class,'articleTextSection')]"):
item = CoindeskItem()
item["category"] = card.xpath(".//a[#class='category']//text()").get()
link = card.xpath(".//a[#class='card-title']/#href").get()
yield response.follow(
link,
callback=self.parse_newsletter,
cb_kwargs={"item": item}
)
def parse_newsletter(self, response, item):
item['headline'] = response.css('.fPbJUO::text').get()
item['content_summary'] = response.css('.jPQVef::text').get()
item['authors'] = response.css('.dWocII a::text').getall()
item['published_date'] = response.css(
'.label-with-icon .fUOSEs::text').get()
yield item
I am trying to scrape job ads from website : https://www.jobs.bg/front_job_search.php?frompage=0&add_sh=1&categories%5B0%5D=29&location_sid=1&keywords%5B0%5D=python&term=#paging
I want to get all visible data - job title, location, short description such as : Full Stack; DBA, Big Data; Data Science, AI, ML and Embedded; Test, QA and scraping part for this is:
result = requests.get("https://www.jobs.bg/front_job_search.php?frompage=0&add_sh=1&categories%5B0%5D=29&location_sid=1&keywords%5B0%5D=python&term=#paging").text
soup = bs4.BeautifulSoup(result, "lxml")
jobs = soup.find_all('td', class_ = "offerslistRow")
for job in jobs:
description = find_all('div', class_="card__subtitle mdc-typography mdc-typography--body2")
and it is [0] part to be precise, as there are two type short descriptions with same class name, but this is not the issue.
Some ads don't have short description, but they also don't have the mentioned div part(it is not empty, it doesn't exist at all).
Is there a way to get description for such ads as well as "N/A" for example or something like that?
I'm assuming you want to scrape all job details as the question was a bit unclear. I have made a few other changes to your code as well, and handled all possible cases.
The following code should do the job-
import bs4
import requests
result = requests.get("https://www.jobs.bg/front_job_search.php?frompage=0&add_sh=1&categories%5B0%5D=29&location_sid=1&keywords%5B0%5D=python&term=#paging").text
soup = bs4.BeautifulSoup(result, "lxml")
# find all jobs
jobs = soup.find_all('td', class_ = "offerslistRow")
# list to store job title
job_title=[]
# list to store job location
job_location=[]
# list to store domain and skills
domain_and_skills=[]
# loop through the jobs
for job in jobs:
# this check is to remove the other two blocks aligned to the right
if job.find('a',class_="card__title mdc-typography mdc-typography--headline6 text-overflow") is not None:
# find and append job name
job_name=job.find('a',class_="card__title mdc-typography mdc-typography--headline6 text-overflow")
job_title.append(job_name.text)
# find and append location and salary description
location_salary_desc=job.find('span',class_='card__subtitle mdc-typography mdc-typography--body2 top-margin')
if location_salary_desc is not None:
job_location.append(location_salary_desc.text.strip())
else:
job_location.append('NA')
# find other two descriptions (Skills and domains)
description = job.find_all(class_="card__subtitle mdc-typography mdc-typography--body2")
# if both are empty (len=0)
if len(description)==0:
domain_and_skills.append('NA')
# if len=1 (can either be skills or domain details)
elif len(description)==1:
# to check if domain is present and skills is empty
if description[0].find('div') is None:
domain_and_skills.append(description[0].text.strip())
# domain is empty and skills is present
else:
# list to store skills
skills=[]
# find all images in skills section and get alt attribute which contains skill name
images=description[0].find_all('img')
# if no image and only text is present (for example Shell Scripts is not an image, contains text value)
if len(images)==0:
skills.append(description[0].text.strip())
# both image and text is present
else:
# for each image, append skill name in list
for image in images:
skills.append(image['alt'])
# append text to list if not empty
if description[0].text.strip() !='':
skills.append(description[0].text.strip())
#convert list to string
skills_string = ','.join([str(skill) for skill in skills])
domain_and_skills.append(skills_string)
# both domain and skills are present
else:
domain_string=description[0].text.strip()
# similar procedure as above to print skill names
skills=[]
images=description[1].find_all('img')
if len(images)==0:
skills.append(description[1].text.strip())
else:
for image in images:
skills.append(image['alt'])
if description[1].text.strip() !='':
skills.append(description[1].text.strip())
skills_string = ','.join([str(skill) for skill in skills])
#combine domain and skills
domain_string=domain_string+','+skills_string
domain_and_skills.append(domain_string)
for i in range(0,len(job_title)):
print(job_title[i])
print(job_location[i])
print(domain_and_skills[i])
New to scrapy here and trying to figure out how to yield only once the item once it's finished populating.
Trying to scrape a site that publishes swimmer times which is built in a way that the pages are structured like this:
Swimmer search page -> Swimmer page with list of swim styles -> Style page with all the times for that style
I am using a nested set of items
Swimmer -> [Styles] -> [Times]
To output a single json dict per every Swimmer, containing all the styles s/he swam and all the times done within each style.
I have the issue that this code yields the same item over and over again rather than just once (as I would want and expect), so creating a lot of waste.
import scrapy
from tempusopen.settings import swimmers
from tempusopen.items import Swimmer, Time, Style
from scrapy.loader import ItemLoader
class BaseUrl(scrapy.Item):
url = scrapy.Field()
class RecordsSpider(scrapy.Spider):
name = 'records_spider'
allowed_domains = ['www.tempusopen.fi']
def start_requests(self):
base_url = ('https://www.tempusopen.fi/index.php?r=swimmer/index&Swimmer[first_name]={firstname}&'
'Swimmer[last_name]={lastname}&Swimmer[searchChoice]=1&Swimmer[swimmer_club]={team}&'
'Swimmer[class]=1&Swimmer[is_active]=1')
urls = [base_url.format_map(x) for x in swimmers]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
swimmer_url = response.xpath('//table//tbody/tr/td/a[#class="view"]/#href').get()
swimmer = Swimmer()
return response.follow(swimmer_url, callback=self.parse_records, meta={'swimmer': swimmer})
def parse_records(self, response):
distances = response.xpath('//table//tbody/tr/td/a[#class="view"]/#href').extract()
swimmer_data = response.xpath("//div[#class='container main']//"
"div[#class='sixteen columns']//text()").extract()
swimmer = response.meta['swimmer']
swimmer['id'] = response.url.split('=')[-1]
swimmer['name'] = swimmer_data[1]
swimmer['team'] = swimmer_data[5].strip('\n').split(',')[0].split(':')[1].strip()
swimmer['status'] = swimmer_data[5].split(',')[1:]
swimmer_data = response.xpath("//div[#class='container main']//"
"div[#class='clearfix']//div[#class='six columns']"
"//text()").extract()
swimmer['born'] = swimmer_data[2].strip('\n')
swimmer['license'] = swimmer_data[4].strip('\n')
for url in distances:
yield response.follow(url, callback=self.parse_distances, meta={'swimmer': swimmer})
def parse_distances(self, response):
swimmer = response.meta['swimmer']
style = Style()
try:
swimmer['styles']
except:
swimmer['styles'] = []
distance = response.xpath('//div[#class="container main"]//p/text()').extract_first()
distance = distance.strip().split('\n')[1]
style['name'] = distance
try:
style['times']
except:
style['times'] = []
swimmer['styles'].append(style)
table_rows = response.xpath("//table//tbody/tr")
for tr in table_rows:
t = Time()
t['time'] = tr.xpath("./td[1]/text()").extract_first().strip("\n\xa0")
t['date'] = tr.xpath("./td[4]/text()").extract_first()
t['competition'] = tr.xpath("./td[5]/text()").extract_first()
style['times'].append(t)
return swimmer
I suppose the issue is to use yield and return in the "right" way but I can't figure out the right solution.
I tried with only yield and I can see the json dict of each swimmer slowly populating.
I tried with only one last return swimmer at the end and everywhere yield but that just gave me the same json dict per each swimmer repeated endlessly...
The wanted behavior would be that the code would output one single json dict per each swimmer I search for in start_urls list (not the gazillions I am getting now).
Any help appreciated, thanks!
Ps. you can pull the code here
As an example of swimmers dict you can use this one:
swimmers = [
# Add here the list of swimmers you are interested in scraping
{'firstname': 'Lenni', 'lastname': 'Parpola', 'team': ''},
{'firstname': 'Tommi', 'lastname': 'Kangas', 'team': ''},
]
You have two alternatives:
Use Scrapy Inline Requests to implement parse_distance
Don't change anything in your code but create a custom pipelines.py and work (add new dictionary details) with each swimmer in process_item section. You'll be able to yield all results at the end of your spider.
Finally figured it out, mostly thanks to this related question.
Because of the way how scrapy works rather than using a normal 'for' loop in the parse_records parser, we need to go through each URL one by one and exhaust the urls in distances before returning a swimmer item back (if we don't do that we will return the same swimmer item multiple times rather than once per swimmer).
This is achieved by popping the first url and passing the distances_urls through the next level and checking that we are really done with this swimmer before firing a 'return' (in the next level of parsing, which is parse_distances).
Once in there we manually go to the next style by popping the url out of the distances_url list and recursively calling this parse_distances method until we exhaust the distances urls list and finally, if we really are done with the styles for this swimmer, we return a swimmer item. Exactly once per every swimmer.
For reference you can see the whole thing in action on github
I have been working on a scrapy spider that is now capable of scraping tripadvisor.com reviews. However, to extract the trip type I need to get my spider to press on 'read more...'. Otherwise the coding does not show up. For the first page (the first 5 reviews), the trip types are still hidden in the code, but by going to the next review page the code completely disappears and it only returns N/A. I have been seeing some previous questions being answered by using Selenium, but after trying I haven't been able to get it working. My code is as follows (without any coding from selenium). Thanks in advance!
# -*- coding: utf-8 -*-
import scrapy
import lxml.html
class TripreviewsSpider(scrapy.Spider):
name = 'tripreviews'
allowed_domains = ['tripadvisor.com']
start_urls = ['https://www.tripadvisor.com/Hotels-g60763-New_York_City_New_York-Hotels.html']
### Listing page crawling
def parse_listing(self, response):
urls = response.css('a.property_title.prominent::attr(href)').extract()
for urlHotel in urls:
print(urlHotel)
urlHotel = response.urljoin(urlHotel)
yield scrapy.Request(url = urlHotel ,callback=self.parse) ###If scrapy recognises a detail page it opens it and will do a callback to line 23 (parse_review).
next_listing_url = response.css('a.nav.next.ui_button.primary.cx_brand_refresh_phase2::attr(href)').extract() ### If scrapy does not recognise a detail page, it will go to the next listing page and starts at line 11 again.
next_listing_url = response.urljoin(next_listing_url)
yield scrapy.Request(url=next_listing_url, callback=self.parse_listing)
def parse(self, response):
self.log('I just visited:' + response.url)
#Set up default value
defaultValue = "N/A"
#Narrow review dataset
foundReviews = response.css('div[data-test-target="reviews-tab"] > div[data-test-target="HR_CC_CARD"]')
#Narrow hoteldetails dataset
foundHotelDetails = response.css("div[class*='ui_column is-12-tablet is-9-mobile hotels-hotel-review-atf-info-parts-ATFInfo__description--']")
#listingName
listingName = []
for hotel in foundHotelDetails:
containslistingName = hotel.css("div[class*='ui_column is-12-tablet is-9-mobile hotels-hotel-review-atf-info-parts-ATFInfo__description--'] > div > h1#HEADING::text").extract()
listingName.append(containslistingName or defaultValue)
listingRating = []
for ratingAverage in foundHotelDetails.css("a[class*='hotels-hotel-review-atf-info-parts-Rating__ratingsAnchor--'] > span[class*='bubble']").xpath("#class").extract():
#For each found rating: strip everything off (extra classes and the prefix 'bubble_') to get the rating value.
#Divide it by 10 to get the proper rating.
listingRating.append( int(ratingAverage.split(" ")[1].split("bubble_")[1]) / 10 )
listingCategory = []
for category in foundHotelDetails:
containsCategory = category.css("div[class*='hotels-hotel-review-atf-info-parts-PopIndex__popIndex--'] > span > a::text").get()
listingCategory.append(containsCategory.split(" ")[0] or defaultValue)
#Build list of review attributes using narrowed dataset
reviewTitle = foundReviews.css('div[data-test-target="review-title"] > a > span > span::text').extract()
#Clean up html tags in review text
formattedReviewText = []
for text in foundReviews:
#For each entry: check if it has an element containing the trip type
foundText = str(text.css("q[class*='location-review-review-list-parts-ExpandableReview__reviewText']").get())
formattedText = lxml.html.fromstring(foundText).text_content()
#Add the type if it was found in text, or add the default value
formattedReviewText.append(formattedText or defaultValue)
reviewDateofStay = foundReviews.css("span[class*='location-review-review-list-parts-EventDate__event_date']::text").extract()
#Because the origin country of the reviewer isn't always present, create a specific list.
#This list makes sure that if trip type is present, it will be added to the corresponding review.
#Otherwise it's defaultValue
#That way the list lengths will always match.
reviewLocation = []
for location in foundReviews:
#For each entry: check if it has an element containing the trip type
containsReviewerOrigin = location.css("span[class*='default social-member-common-MemberHometown__hometown']::text").get()
#Add the type if it was found in text, or add the default value
reviewLocation.append(containsReviewerOrigin or defaultValue)
#Because ratings are immediately turned into a css class, you cannot extract this immediately.
#You need to strip the exact class out and then divide by 10 so you can get the rating in a scale from 1.0 to 5.0.
#Ratings are also optional....
reviewRating = []
for rating in foundReviews.css("div[data-test-target='review-rating'] > span[class*='bubble']").xpath("#class").extract():
#For each found rating: strip everything off (extra classes and the prefix 'bubble_') to get the rating value.
#Divide it by 10 to get the proper rating.
reviewRating.append( int(rating.split(" ")[1].split("bubble_")[1]) / 10 )
#Because trip type isn't always present, create a specific list.
#This list makes sure that if trip type is present, it will be added to the corresponding review.
#Otherwise it's defaultValue
#That way the list lengths will always match.
reviewTripType = []
for review in foundReviews:
#For each entry: check if it has an element containing the trip type
containsTripType = review.css("span[class*='location-review-review-list-parts-TripType__trip_type--']::text").get()
#Add the type if it was found in text, or add the default value
reviewTripType.append(containsTripType or defaultValue)
reviewHelpfulVotes = []
for helpfulVotes in foundReviews:
#For each entry: check if it has an element containing the trip type
hasHelpfulVotes = helpfulVotes.css("div[class*='location-review-social-bar-SocialBar__bar--'] > span[class*='social-statistics-bar-SocialStatisticsBar__bar--'] > span[class*='social-statistics-bar-SocialStatisticsBar__counts']::text").get()
#Add the type if it was found in text, or add the default value. Check if hasHelpfulVotes isn't Nonetype.
#And split value on first space and get first element because default value does not contain spaces it will never look weird and will always make sure that validHelpfulVotes values are always split correctly.
validHelpfulVote = hasHelpfulVotes or defaultValue
reviewHelpfulVotes.append(validHelpfulVote.split(" ")[0])
reviewPrefix = {
'name': listingName,
'listing Rating': listingRating,
'category': listingCategory
}
#Create the review data
for review in zip(reviewLocation, reviewTitle, formattedReviewText, reviewRating, reviewDateofStay, reviewTripType, reviewHelpfulVotes):
reviewData = {
'Origin of reviewer': review[0],
'Title': review[1],
'ReviewText': review[2],
'Rating': review[3] or defaultValue,
'Date of stay': review[4] or defaultValue,
'Trip type': review[5],
'Helpful votes': review[6]
}
reviewTotal = {**reviewPrefix,**reviewData}
yield reviewTotal
### Next review page
nextReviewPage = response.css('a.ui_button.nav.next.primary::attr(href)').extract_first()
if nextReviewPage:
nextReviewPage = response.urljoin(nextReviewPage)
print(nextReviewPage) #For checking purposes
yield scrapy.Request(url = nextReviewPage, callback = self.parse)
yield scrapy.Request(url = 'https://www.tripadvisor.com/Hotels-g60763-New_York_City_New_York-Hotels.html', callback = self.parse_listing)
So I have scrapy working really well. It's grabbing data out of a page, but the problem I'm running into is that sometimes the page's table order is different. For example, the first page it gets to:
Row name Data
Name 1 data 1
Name 2 data 2
The next page it crawls to might have the order completely different. Where Name 1 was the first row, any other page it might be the 3rd, or 4th etc. The row names are always the same. I was thinking of doing this possibly 1 of 2 different ways, I'm not sure which will work or which is better.
First option, use some if statements to find the row I need, and then grab the following column. This seems a little messy but could work.
Second option, grab all the data in the table regardless of order and put it in a dict. This way, I can grab the data I need based on row name. This seems like the cleanest approach.
Is there a 3rd option or a better way of doing either?
Here's my code in case it's helpful.
class pageSpider(Spider):
name = "pageSpider"
allowed_domains = ["domain.com"]
start_urls = [
"http://domain.com/stuffs/results",
]
visitedURLs = Set()
def parse(self, response):
products = Selector(response).xpath('//*[#class="itemCell"]')
for product in products:
item = PageScraper()
item['url'] = product.xpath('div[2]/div/a/#href').extract()[0]
urls = Set([product.xpath('div[2]/div/a/#href').extract()[0]])
print urls
for url in urls:
if url not in self.visitedURLs:
request = Request(url, callback=self.productpage)
request.meta['item'] = item
yield request
def productpage(self, response):
specs = Selector(response).xpath('//*[#id="Specs"]')
item = response.meta['item']
for spec in specs:
item['make'] = spec.xpath('fieldset[1]/dl[1]/dd/text()').extract()[0].encode('utf-8', 'ignore')
item['model'] = spec.xpath('fieldset[1]/dl[4]/dd/text()').extract()[0].encode('utf-8', 'ignore')
item['price'] = spec.xpath('fieldset[2]/dl/dd/text()').extract()[0].encode('utf-8', 'ignore')
yield item
The xpaths in productpage can contain data that doesn't correspond to what I need, because the order changed.
Edit:
I'm trying the dict approach and I think this is the best option.
def productpage(self, response):
specs = Selector(response).xpath('//*[#id="Specs"]/fieldset')
itemdict = {}
for i in specs:
test = i.xpath('dl')
for t in test:
itemdict[t.xpath('dt/text()').extract()[0]] = t.xpath('dd/text()').extract()[0]
item = response.meta['item']
item['make'] = itemdict['Brand']
yield item
This seems like the best and cleanest approach (using dict)
def productpage(self, response):
specs = Selector(response).xpath('//*[#id="Specs"]/fieldset')
itemdict = {}
for i in specs:
test = i.xpath('dl')
for t in test:
itemdict[t.xpath('dt/text()').extract()[0]] = t.xpath('dd/text()').extract()[0]
item = response.meta['item']
item['make'] = itemdict['Brand']
yield item