I'm creating a spider using scrapy to scrape Details from rottentomatoes.com. As the search page is rendered dynamically, I used the rottentomatoes API for eg:https://www.rottentomatoes.com/api/private/v2.0/search?q=inception to get the search results and URL. Following the URL via scrapy, I was able to extract the tomatometer score, audience score, director, cast etc. However, I want to extract all the audience reviews too. The issue is that, audience reviews page (https://www.rottentomatoes.com/m/inception/reviews?type=user) works using pagination and I'm not able to extract data from next page, moreover I couldn't find a way to use the API to extract the details either. Could anyone help me on this.
def parseRottenDetail(self, response):
print("Reached Tomato Parser")
try:
if MoviecrawlSpider.current_parse <= MoviecrawlSpider.total_results:
items = TomatoCrawlerItem()
MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse]['tomatometerScore'] = response.css(
'.mop-ratings-wrap__row .mop-ratings-wrap__half .mop-ratings-wrap__percentage::text').get().strip()
MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse][
'tomatoAudienceScore'] = response.css(
'.mop-ratings-wrap__row .mop-ratings-wrap__half.audience-score .mop-ratings-wrap__percentage::text').get().strip()
MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse][
'tomatoCriticConsensus'] = response.css('p.mop-ratings-wrap__text--concensus::text').get()
if MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse]["type"] == "Movie":
MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse]['Director'] = response.xpath(
"//ul[#class='content-meta info']/li[#class='meta-row clearfix']/div[contains(text(),'Directed By')]/../div[#class='meta-value']/a/text()").get()
else:
MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse]['Director'] = response.xpath(
"//div[#class='tv-series__series-info-castCrew']/div/span[contains(text(),'Creator')]/../a/text()").get()
reviews_page = response.css('div.mop-audience-reviews__view-all a[href*="reviews"]::attr(href)').get()
if len(reviews_page) != 0:
yield response.follow(reviews_page, callback=self.parseRottenReviews)
else:
for key in MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse].keys():
if "pageURL" not in key and "type" not in key:
items[key] = MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse][key]
yield items
if MoviecrawlSpider.current_parse <= MoviecrawlSpider.total_results:
MoviecrawlSpider.current_parse += 1
print("Parse Values are Current Parse " + str(
MoviecrawlSpider.current_parse) + "and Total Results " + str(MoviecrawlSpider.total_results))
yield response.follow(MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse]["pageURL"],
callback=self.parseRottenDetail)
except Exception as e:
exc_type, exc_obj, exc_tb = sys.exc_info()
print(e)
print(exc_tb.tb_lineno)
After this piece of code is executed I reach the page of reviews ie for eg: https://www.rottentomatoes.com/m/inception/reviews?type=user, Hereafter there is a next button and next page is loaded using pagination. So What should be my approach to extract all the reviews?
def parseRottenReviews(self, response):
print("Reached Rotten Review Parser")
items = TomatoCrawlerItem()
When you go to the next page, you can notice that it uses the previous end cursor value of the page. You can set the endCursor with empty string for the first iteration. Also note that you need the movieId for requuesting reviews, this id can be extracted from the json embedded from JS :
import requests
import re
import json
r = requests.get("https://www.rottentomatoes.com/m/inception/reviews?type=user")
data = json.loads(re.search('movieReview\s=\s(.*);', r.text).group(1))
movieId = data["movieId"]
def getReviews(endCursor):
r = requests.get(f"https://www.rottentomatoes.com/napi/movie/{movieId}/reviews/user",
params = {
"direction": "next",
"endCursor": endCursor,
"startCursor": ""
})
return r.json()
reviews = []
result = {}
for i in range(0, 5):
print(f"[{i}] request review")
result = getReviews(result["pageInfo"]["endCursor"] if i != 0 else "")
reviews.extend([t for t in result["reviews"]])
print(reviews)
print(f"got {len(reviews)} reviews")
Note that you can also scrape the html for the first iteration
As I'm using Scrapy, I was looking for a way to perform this without using the requests module. The approach is the same, but I found that the page https://www.rottentomatoes.com/m/inception had an object root.RottenTomatoes.context.fandangoData in the <script> tag, which had a key "emsId" which had the Id of the movie which is passed to the api to get details. Going through the network tab on each pagination event, I realised that they used startCursor and endCursor to filter results for each page.
pattern = r'\broot.RottenTomatoes.context.fandangoData\s*=\s*(\{.*?\})\s*;\s*\n'
json_data = response.css('script::text').re_first(pattern)
movie_id = json.loads(json_data)["emsId"]
{SpiderClass}.movieId = movie_id
next_page = "https://www.rottentomatoes.com/napi/movie/" + movie_id + "/reviews/user?direction=next&endCursor=&startCursor="
yield response.follow(next_page, callback=self.parseRottenReviews)
For the first iteration, you could leave the startCursor and endCursor parameter blank. Now you enter the parse function. You could get the startCursor and endCursor parameters of the next page from the current response. Repeat this until the hasNextPage attribute is false.
def parseRottenReviews(self, response):
print("Reached Rotten Review Parser")
current_result = json.loads(response.text)
for review in current_result["reviews"]:
{SpiderClass}.reviews.append(review) #Spider class memeber So that it could be shared among iterations
if current_result["pageInfo"]["hasNextPage"] is True:
next_page = "https://www.rottentomatoes.com/napi/movie/" + \
str({SpiderClass}.movieId) + "/reviews/user?direction=next&endCursor=" + str(
current_result["pageInfo"][
"endCursor"]) + "&startCursor=" + str(current_result["pageInfo"]["startCursor"])
yield response.follow(next_page, callback=self.parseRottenReviews)
Now the {SpiderClass}.reviews array will have the reviews
Related
How can I go about parsing data for one variable directly from the start url and data for other variables after following all the href from the start url?
The web page I want to scrape has a list of articles with the "category", "title", "content", "author" and "date" data. In order to scrape the data, I followed all "href" on the start url which redirect to the full article and parsed the data. However, the "category" data is not always available when individual article is opened/followed from the "href" in the start url so it ends up having missing data for some observations. Now, I'm trying to scrape just the "category" data directly from the start url which has the "category" data for all article listings (no missing data). How should I go about parsing "category" data? How should I take care of the parsing and callback? The "category" data is circle in red in the image
class BtcNewsletterSpider(scrapy.Spider):
name = 'btc_spider'
allowed_domains = ['www.coindesk.com']
start_urls = ['https://www.coindesk.com/tag/bitcoin/1/']
def parse(self, response):
for link in response.css('.card-title'):
yield response.follow(link, callback=self.parse_newsletter)
def parse_newsletter(self, response):
item = CoindeskItem()
item['category'] = response.css('.kjyoaM::text').get()
item['headline'] = response.css('.fPbJUO::text').get()
item['content_summary'] = response.css('.jPQVef::text').get()
item['authors'] = response.css('.dWocII a::text').getall()
item['published_date'] = response.css(
'.label-with-icon .fUOSEs::text').get()
yield item
You can use the cb_kwargs argument to pass data from one parse callback to another. To do this you would need to grab the value of the category for the corresponding link to the full article. This can be done by simply iterating through any element that encompasses both the category and the link and pulling the information from both out of said element.
Here is an example based on the code you provided, this should work the way you described.
class BtcNewsletterSpider(scrapy.Spider):
name = 'btc_spider'
allowed_domains = ['www.coindesk.com']
start_urls = ['https://www.coindesk.com/tag/bitcoin/1/']
def parse(self, response):
for card in response.xpath("//div[contains(#class,'articleTextSection')]"):
item = CoindeskItem()
item["category"] = card.xpath(".//a[#class='category']//text()").get()
link = card.xpath(".//a[#class='card-title']/#href").get()
yield response.follow(
link,
callback=self.parse_newsletter,
cb_kwargs={"item": item}
)
def parse_newsletter(self, response, item):
item['headline'] = response.css('.fPbJUO::text').get()
item['content_summary'] = response.css('.jPQVef::text').get()
item['authors'] = response.css('.dWocII a::text').getall()
item['published_date'] = response.css(
'.label-with-icon .fUOSEs::text').get()
yield item
I've got the following code that filters a particular search on an auction site.
I can display the titles of each value & also the len of all returned values:
from bs4 import BeautifulSoup
import requests
url = requests.get("https://www.trademe.co.nz/a/marketplace/music-instruments/instruments/guitar-bass/electric-guitars/search?search_string=prs&condition=used")
soup = BeautifulSoup(url.text, "html.parser")
listings = soup.findAll("div", attrs={"class":"tm-marketplace-search-card__title"})
print(len(listings))
for listing in listings:
print(listing.text)
This prints out the following:
#print(len(listings))
3
#for listing in listings:
# print(listing.text)
PRS. Ten Top Custom 24, faded Denim, Piezo.
PRS SE CUSTOM 22
PRS Tremonti SE *With Seymour Duncan Pickups*
I know what I want to do next, but don't know how to code it. Basically I want to only display new results. I was thinking storing the len of the listings (3 at the moment) as a variable & then comparing that with another GET request (2nd variable) that maybe runs first thing in the morning. Alternatively compare both text values instead of the len. If it doesn't match, then it shows the new listings. Is there a better or different way to do this? Any help appreciated thank you
With length-comparison, there is the issue of some results being removed between checks, so it might look like there are no new results even if there are; and text-comparison does not account for results with similar titles.
I can suggest 3 other methods. (The 3rd uses my preferred approach.)
Closing time
A comment suggested using the closing time, which can be found in the tag before the title; you can define a function to get the days until closing
from datetime import date
import dateutil.parser
def get_days_til_closing(lSoup):
cTxt = lSoup.previous_sibling.find('div', {'tmid':'closingtime'}).text
cTime = dateutil.parser.parse(cTxt.replace('Closes:', '').strip())
return (cTime.date() - date.today()).days
and then filter by the returned value
min_dtc = 3 # or as preferred
# your current code upto listings = soup.findAll....
new_listings = [l for l in listings if get_days_til_closing(l) > min_dtc]
print(len(new_listings), f'new listings [of {len(listings)}]')
for listing in new_listings: print(listing.text)
However, I don't know if sellers are allowed to set their own closing times or if they're set at a fixed offset; also, I don't see the closing time text when inspecting with the browser dev tools [even though I could extract it with the code above], and that makes me a bit unsure of whether it's always available.
JSON list of Listing IDs
Each result is in a "card" with a link to the relevant listing, and that link contains a number that I'm calling the "listing ID". You can save that in a list as a JSON file and keep checking against it every new scrape
from bs4 import BeautifulSoup
import requests
import json
lFilename = 'listing_ids.json' # or as preferred
url = requests.get("https://www.trademe.co.nz/a/marketplace/music-instruments/instruments/guitar-bass/electric-guitars/search?search_string=prs&condition=used")
try:
prev_listings = json.load(open(lFilename, 'r'))
except Exception as e:
prev_listings = []
print(len(prev_listings), 'saved listings found')
soup = BeautifulSoup(url.text, "html.parser")
listings = soup.select("div.o-card > a[href*='/listing/']")
new_listings = [
l for l in listings if
l.get('href').split('/listing/')[1].split('?')[0]
not in prev_listings
]
print(len(new_listings), f'new listings [of {len(listings)}]')
for listing in new_listings:
print(listing.select_one('div.tm-marketplace-search-card__title').text)
with open(lFilename, 'w') as f:
json.dump(prev_listings + [
l.get('href').split('/listing/')[1].split('?')[0]
for l in new_listings
], f)
This should be fairly reliable as long as they don't tend to recycle the listing ids, this should be fairly reliable. (Even then, every once in a while, after checking the new listings for that day, you can just delete the JSON file and re-run the program once; it will also keep the file from getting too big...)
CSV Logging [including Listing IDs]
Instead of just saving the IDs, you can save pretty much all the details from each result
from bs4 import BeautifulSoup
import requests
from datetime import date
import pandas
lFilename = 'listings.csv' # or as preferred
max_days = 60 # or as preferred
date_today = date.today()
url = requests.get("https://www.trademe.co.nz/a/marketplace/music-instruments/instruments/guitar-bass/electric-guitars/search?search_string=prs&condition=used")
try:
prev_listings = pandas.read_csv(lFilename).to_dict(orient='records')
prevIds = [str(l['listing_id']) for l in prev_listings]
except Exception as e:
prev_listings, prevIds = [], []
print(len(prev_listings), 'saved listings found')
def get_listing_details(lSoup, prevList, lDate=date_today):
selectorsRef = {
'title': 'div.tm-marketplace-search-card__title',
'location_time': 'div.tm-marketplace-search-card__location-and-time',
'footer': 'div.tm-marketplace-search-card__footer',
}
lId = lSoup.get('href').split('/listing/')[1].split('?')[0]
lDets = {'listing_id': lId}
for k, sel in selectorsRef.items():
s = lSoup.select_one(sel)
lDets[k] = None if s is None else s.text
lDets['listing_link'] = 'https://www.trademe.co.nz/a/' + lSoup.get('href')
lDets['new_listing'] = lId not in prevList
lDets['last_scraped'] = lDate.isoformat()
return lDets
soup = BeautifulSoup(url.text, "html.parser")
listings = [
get_listing_details(s, prevIds) for s in
soup.select("div.o-card > a[href*='/listing/']")
]
todaysIds = [l['listing_id'] for l in listings]
new_listings = [l for l in listings if l['new_listing']]
print(len(new_listings), f'new listings [of {len(listings)}]')
for listing in new_listings: print(listing['title'])
prev_listings = [
p for p in prev_listings if str(p['listing_id']) not in todaysIds
and (date_today - date.fromisoformat(p['last_scraped'])).days < max_days
]
pandas.DataFrame(prev_listings + listings).to_csv(lFilename, index=False)
You'll end up with a spreadsheet of scraping history/log that you can check anytime, and depending on what you set max_days to, the oldest data will be automatically cleared.
Fixed it with the following:
allGuitars = ["",]
latestGuitar = soup.select("#-title")[0].text.strip()
if latestGuitar in allGuitars[0]:
print("No change. The latest listing is still: " + allGuitars[0])
elif not latestGuitar in allGuitars[0]:
print("New listing detected! - " + latestGuitar)
allGuitars.clear()
allGuitars.insert(0, latestGuitar)
I am trying to find a way to loop through URLs and scrape paginated tables in each of them. The issue arises when some URLs have differing page numbers (in some cases there is no table!). Can someone explain to me where I went wrong and how to fix this? (Please let me know if you require further info.)
def get_injuries(pages):
Injuries_list = []
for page in range(1, pages+1):
for player_id in range(1,10):
headers = {"User-Agent":"Mozilla/5.0"}
url = 'https://www.transfermarkt.co.uk/neymar/verletzungen/spieler/' + str(player_id)
print(url)
html = requests.get(url, headers = headers)
soup = bs(html.content)
# Select first table
if soup.select('.responsive-table > .grid-view > .items > tbody'):
soup = soup.select('.responsive-table > .grid-view > .items > tbody')[0]
try:
for cells in soup.find_all(True, {"class": re.compile("^(even|odd)$")}):
Season = cells.find_all('td')[1].text
Strain = cells.find_all('td')[2].text
Injury_From = cells.find_all('td')[3].text
Injury_To = cells.find_all('td')[4].text
Duration_days = cells.find_all('td')[5].text
Games_missed = cells.find_all('td')[6].text
Club_affected = cells.find_all('td')[6].img['alt']
player = {
'name': cells.find_all("h1", {"itemprop": "name"}),
'Season': Season,
'Strain': Strain,
'Injury_from': Injury_From,
'Injury_To': Injury_To,
'Duration (days)': Duration_days,
'Games_Missed': Games_missed,
'Club_Affected': Club_affected
}
players_list.append(player)
except IndexError:
pass
return Injuries_list
return Injuries_list
This should be in the outer most for loop. After looping once it's returning only one url.
Only one for loop is sufficient to get the data.
players_list=[] .I dont' see this, Create one at the start
you are not doing anything with this list Injuries_list. It's returning an empty list
I am using a tutorial I found (linke) to scrape news sites using the python library newspaper and feedparser.
It reads the link to process from a json file and then get the articles from it. The issue was it can only get articles from the first page and doesnt iterate to the 2nd, 3rd and so on. So, I wrote a script to populate the json file with the first 50 pages of a site, eg www.site.com/page/x.
{
"site0" : { "link" : "https://sitedotcom/page/0/"},
"site1" : { "link" : "https://sitedotcom/page/1/"},
"site2" : { "link" : "https://sitedotcom/page/2/"}
etc
}
# Set the limit for number of articles to download
LIMIT = 1000000000
articles_array = []
data = {}
data['newspapers'] = {}
# Loads the JSON files with news sites
with open('thingie2.json') as data_file:
companies = json.load(data_file)
count = 1
# Iterate through each news company
for company, value in companies.items():
# If a RSS link is provided in the JSON file, this will be the first choice.
# Reason for this is that, RSS feeds often give more consistent and correct data. RSS (Rich Site Summary; originally RDF Site Summary; often called Really Simple Syndication) is a type of
# web feed which allows users to access updates to online content in a standardized, computer-readable format
# If you do not want to scrape from the RSS-feed, just leave the RSS attr empty in the JSON file.
if 'rss' in value:
d = fp.parse(value['rss'])
print("Downloading articles from ", company)
newsPaper = {
"rss": value['rss'],
"link": value['link'],
"articles": []
}
for entry in d.entries:
# Check if publish date is provided, if no the article is skipped.
# This is done to keep consistency in the data and to keep the script from crashing.
if hasattr(entry, 'published'):
if count > LIMIT:
break
article = {}
article['link'] = entry.link
date = entry.published_parsed
article['published'] = datetime.fromtimestamp(mktime(date)).isoformat()
try:
content = Article(entry.link)
content.download()
content.parse()
except Exception as e:
# If the download for some reason fails (ex. 404) the script will continue downloading
# the next article.
print(e)
print("continuing...")
continue
article['title'] = content.title
article['text'] = content.text
article['authors'] = content.authors
article['top_image'] = content.top_image
article['movies'] = content.movies
newsPaper['articles'].append(article)
articles_array.append(article)
print(count, "articles downloaded from", company, ", url: ", entry.link)
count = count + 1
else:
# This is the fallback method if a RSS-feed link is not provided.
# It uses the python newspaper library to extract articles
print("Building site for ", company)
paper = newspaper.build(value['link'], memoize_articles=False)
newsPaper = {
"link": value['link'],
"articles": []
}
noneTypeCount = 0
for content in paper.articles:
if count > LIMIT:
break
try:
content.download()
content.parse()
except Exception as e:
print(e)
print("continuing...")
continue
# Again, for consistency, if there is no found publish date the article will be skipped.
# After 10 downloaded articles from the same newspaper without publish date, the company will be skipped.
article = {}
article['title'] = content.title
article['authors'] = content.authors
article['text'] = content.text
article['top_image'] = content.top_image
article['movies'] = content.movies
article['link'] = content.url
article['published'] = content.publish_date
newsPaper['articles'].append(article)
articles_array.append(article)
print(count, "articles downloaded from", company, " using newspaper, url: ", content.url)
count = count + 1
#noneTypeCount = 0
count = 1
data['newspapers'][company] = newsPaper
#Finally it saves the articles as a CSV-file.
try:
f = csv.writer(open('Scraped_data_news_output2.csv', 'w', encoding='utf-8'))
f.writerow(['Title', 'Authors','Text','Image','Videos','Link','Published_Date'])
#print(article)
for artist_name in articles_array:
title = artist_name['title']
authors=artist_name['authors']
text=artist_name['text']
image=artist_name['top_image']
video=artist_name['movies']
link=artist_name['link']
publish_date=artist_name['published']
# Add each artist’s name and associated link to a row
f.writerow([title, authors, text, image, video, link, publish_date])
except Exception as e: print(e)
Navigating in my browser to these sites brings up older and unique articles as expected. But when I run the script on them it returns the same articles no matter the page number. Is there something I have done wrong or have not considered ?
I'm trying to write a crawler which does the following job:
Start from a specific URL, then crawl the pages up to a depth limit (simple Scrapy task). During this time, check the first 1 kB of the header field, and extract the file information with Magic. If it is a file, I want to send it to a pipeline for further processing.
What I did so far is:
def process_links(self, links):
for link in links:
try:
req = urlopen(link.url)
except Exception as e:
self.logger.error("--> Exception URL[%s]: %s", link, e)
req = None
if req is not None:
ext_type = magic.from_buffer(req.read(1024))
if 'HTML' in ext_type:
yield link
else:
item = FileItem()
item["type"] = "FILE"
item["url"] = link.url
item["extension"] = ext_type
itemproc = self.crawler.engine.scraper.itemproc
itemproc.process_item(item, self)
So I basically get the URLs from the scraped page, then depending on their type, either I yield a new Request or I send a new FileItem item to the pipeline manually (with the last two lines)
I also have parse_page function for parsing HTML pages:
def parse_page(self, response):
item = PageItem()
item["type"] = "PAGE"
item["url"] = response.request.url
item["depth"] = response.meta['depth']
item["response_url"] = response.url
item["status_code"] = response.status
item["status_msg"] = "OK"
item["request_headers"] = response.request.headers
item["body"] = response.body
item["links"] = []
yield item
Which yields the PageItem object.
Although the yielded PageItem fires the item_passed signal in Scrapy, manually passed FileItems do not fire the signal. Am I missing something while manually passing the FileItems to the pipeline?
Thanks in advance!