Scrapy Pull Same Data from Multiple Pages - python

This is related to the previous question I wrote here. I am trying to pull the same data from multiple pages on the same domain. A small explanation, I'm trying to pull data like offensive yards, turnovers, etc from a bunch of different box scores on a main page. Pulling the data from individual pages is working properly as is generation of the urls but when I try to have the spider cycle through all of the pages nothing is returned. I've looked through many other questions people have asked and the documentation and I can't figure out what is not working. Code is below. Thanks to anyone who's able to help in advance.
import scrapy
from scrapy import Selector
from nflscraper.items import NflscraperItem
class NFLScraperSpider(scrapy.Spider):
name = "pfr"
allowed_domains = ['www.pro-football-reference.com/']
start_urls = [
"http://www.pro-football-reference.com/years/2015/games.htm"
#"http://www.pro-football-reference.com/boxscores/201510110tam.htm"
]
def parse(self,response):
for href in response.xpath('//a[contains(text(),"boxscore")]/#href'):
item = NflscraperItem()
url = response.urljoin(href.extract())
request = scrapy.Request(url, callback=self.parse_dir_contents)
request.meta['item'] = item
yield request
def parse_dir_contents(self,response):
item = response.meta['item']
# Code to pull out JS comment - https://stackoverflow.com/questions/38781357/pro-football-reference-team-stats-xpath/38781659#38781659
extracted_text = response.xpath('//div[#id="all_team_stats"]//comment()').extract()[0]
new_selector = Selector(text=extracted_text[4:-3].strip())
# Item population
item['home_score'] = response.xpath('//*[#id="content"]/table/tbody/tr[2]/td[last()]/text()').extract()[0].strip()
item['away_score'] = response.xpath('//*[#id="content"]/table/tbody/tr[1]/td[last()]/text()').extract()[0].strip()
item['home_oyds'] = new_selector.xpath('//*[#id="team_stats"]/tbody/tr[6]/td[2]/text()').extract()[0].strip()
item['away_oyds'] = new_selector.xpath('//*[#id="team_stats"]/tbody/tr[6]/td[1]/text()').extract()[0].strip()
item['home_dyds'] = item['away_oyds']
item['away_dyds'] = item['home_oyds']
item['home_turn'] = new_selector.xpath('//*[#id="team_stats"]/tbody/tr[8]/td[2]/text()').extract()[0].strip()
item['away_turn'] = new_selector.xpath('//*[#id="team_stats"]/tbody/tr[8]/td[1]/text()').extract()[0].strip()
yield item

The subsequent requests you make are filtered as offsite, fix your allowed_domains setting:
allowed_domains = ['pro-football-reference.com']
Worked for me.

Related

My scrapy pagination works but it just show me the first page data for all pages

Im new to python and i mix up many example codes for my scrapy but it just show me the first page data for all pages. what is the problem?
My code is:
import scrapy
from scrapy.item import Item, Field
class HotelAbbasiItem(Item):
reviewer=Field()
DateOfReview=Field()
Nationality=Field()
Contribution=Field()
ReviewText=Field()
Rating=Field()
class HotelabbasiSpider(scrapy.Spider):
name = 'HotelAbbasi'
allowed_domains = ['tripadvisor.com']
start_urls = ['https://www.tripadvisor.com/Hotel_Review-g295423-d320767-Reviews-Abbasi_Hotel-Isfahan_Isfahan_Province.html']
def parse(self,response):
items=HotelAbbasiItem()
all_div_parts=response.css('div.hotels-community-tab-common-Card__section--4r93H')
for part in all_div_parts:
reviewer=part.css('a.social-member-event-MemberEventOnObjectBlock__member--35-jC::text').extract()
DateOfReview=part.css('span::text').extract()
Nationality=part.css('span.small::text').extract()
Contribution=part.css('span.social-member-MemberHeaderStats__bold--3z3qh::text').extract()
ReviewText=part.css('q.location-review-review-list-parts-ExpandableReview__reviewText--gOmRC>span::text').extract()
Rating=part.css('div.location-review-review-list-parts-RatingLine__bubbles--GcJvM>span::attr(class)').extract()
items['reviewer']=reviewer
items['DateOfReview']=DateOfReview
items['Nationality']=Nationality
items['Contribution']=Contribution
items['ReviewText']=ReviewText
items['Rating']=Rating
yield items
NextPage=response.css('div.is-centered>a.primary::attr(href)').extract_first()
if NextPage:
NextPage=response.urljoin(NextPage)
yield scrapy.Request(url=NextPage,callback=self.parse)

How to deal with redirects to a bookmark within a page in Scrapy (911 error)

I am very new to programming, so apologies if this is a rookie issue. I am a researcher, and I've been building spiders to allow me to crawl specific search results of IGN, the gaming forum. The first spider collects each entry in the search results, along with URLs, and then the second spider crawls each of those URLs for the content.
The problem is that IGN redirects URLs associated with a specific post to a new URL that incorporates a #bookmark at the end of the address. This allows the visitor to the page to jump directly down to the post in question, but I want my spider to crawl over the entire thread. In addition, my spider ends up with a (911) error after the redirect and returns no data. The only data retrieved is from any search results that linked directly to a thread rather than a post.
I am absolutely stumped and confused, so any help would amazing! Both spiders are attached below.
Spider 1:
myURLs = [] baselineURL = "https://www.ign.com/boards/search/186716896/?q=broforce&o=date&page=" for counter in range (1,5):
myURLs.append(baselineURL + str(counter))
class BroforceIGNScraper(scrapy.Spider):
name = "foundation"
start_urls = myURLs
def parse(self,response):
for post in response.css("div.main"):
yield {
'title': post.css("h3.title a::text").extract_first(),
'author': post.css("div.meta a.username::text").extract_first(),
'URL': post.css('h3 a').xpath('#href').extract_first(),
}
Spider 2:
URLlist = []
baseURL = "https://www.ign.com/boards/"
import csv
with open('BroforceIGNbase.csv', 'r', newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
URLlist.append(baseURL + row['URL'])
class BroforceIGNScraper(scrapy.Spider):
name = "posts2"
start_urls = URLlist
# handle_httpstatus_list = [301]
def parse(self,response):
for post in response.css(".messageList"):
yield {
'URL': response.url,
'content': post.css(".messageContent article").extract_first(),
'commentauthor': post.css("div.messageMeta a::text").extract_first(),
'commentDateTime': post.css('div.messageMeta a span.DateTime').xpath('#title').extract_first(),
}

wrong Xpath in IMDB spider scrapy

Here:
IMDB scrapy get all movie data
response.xpath("//*[#class='results']/tr/td[3]")
returns empty list. I tried to change it to:
response.xpath("//*[contains(#class,'chart full-width')]/tbody/tr")
without success.
Any help please? Thanks.
I did not have time to go through IMDB scrapy get all movie data thoroughly, but have got the gist of it. The Problem statement is to get All movie data from the given site. It involves two things. First is to go through all the pages that contain the list of all the movies of that year. While the Second one is to get the link to each movie and then here you do your own magic.
The problem you faced is with the getting the xpath for the link to each movies. This may most likely be due to change in the website structure (I did not have time to verify what maybe the difference). Anyways, following is the xpath you would require.
FIRST :
We take div class nav as a landmark and find the lister-page-next next-page class in its children.
response.xpath("//div[#class='nav']/div/a[#class='lister-page-next next-page']/#href").extract_first()
Here this will give : Link for the next page | returns None if at the last page (since next-page tag not present)
SECOND :
This is the original doubt by the OP.
#Get the list of the container having the title, etc
list = response.xpath("//div[#class='lister-item-content']")
#From the container extract the required links
paths = list.xpath("h3[#class='lister-item-header']/a/#href").extract()
Now all you would need to do is loop through each of these paths element and request the page.
Thanks for your answer. I eventually used your xPath like so:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from crawler.items import MovieItem
IMDB_URL = "http://imdb.com"
class IMDBSpider(CrawlSpider):
name = 'imdb'
# in order to move the next page
rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=("//div[#class='nav']/div/a[#class='lister-page-next next-page']",)),
callback="parse_page", follow= True),)
def __init__(self, start=None, end=None, *args, **kwargs):
super(IMDBSpider, self).__init__(*args, **kwargs)
self.start_year = int(start) if start else 1874
self.end_year = int(end) if end else 2017
# generate start_urls dynamically
def start_requests(self):
for year in range(self.start_year, self.end_year+1):
# movies are sorted by number of votes
yield scrapy.Request('http://www.imdb.com/search/title?year={year},{year}&title_type=feature&sort=num_votes,desc'.format(year=year))
def parse_page(self, response):
content = response.xpath("//div[#class='lister-item-content']")
paths = content.xpath("h3[#class='lister-item-header']/a/#href").extract() # list of paths of movies in the current page
# all movies in this page
for path in paths:
item = MovieItem()
item['MainPageUrl'] = IMDB_URL + path
request = scrapy.Request(item['MainPageUrl'], callback=self.parse_movie_details)
request.meta['item'] = item
yield request
# make sure that the start_urls are parsed as well
parse_start_url = parse_page
def parse_movie_details(self, response):
pass # lots of parsing....
Runs it with scrapy crawl imdb -a start=<start-year> -a end=<end-year>

Scrapy only scrapes the first start url in a list of 15 start urls

I am new to Scrapy and am attempting to teach myself the basics. I have compiled a code that goes to the Louisiana Department of Natural Resources website to retrieve the serial number for certain oil wells.
I have each well's link listed in the start URLs command, but scrappy only downloads data from the first url. What am I doing wrong?
import scrapy
from scrapy import Spider
from scrapy.selector import Selector
from mike.items import MikeItem
class SonrisSpider(Spider):
name = "sspider"
start_urls = [
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=207899",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=971683",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=214206",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=159420",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=243671",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=248942",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=156613",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=972498",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=215443",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=248463",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=195136",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=179181",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=199930",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=203419",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=220454",
]
def parse(self, response):
item = MikeItem()
item['serial'] = response.xpath('/html/body/table[1]/tr[2]/td[1]/text()').extract()[0]
yield item
Thank you for any help you might be able to provide. If I have not explained my problem thoroughly, please let me know and I will attempt to clarify.
I think this code might help,
By default scrapy prevent duplicate requests. Since only the parameters are different in your start-url scrapy will consider the rest of the urls in the start-url as duplicate request of the first one. That's why your spider stops after fetching the first url. In order to parse the rest of the urls we have enable dont_filter flag in the scrapy request. (chek the start_request())
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from mike.items import MikeItem
class SonrisSpider(scrapy.Spider):
name = "sspider"
allowed_domains = ["sonlite.dnr.state.la.us"]
start_urls = [
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=207899",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=971683",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=214206",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=159420",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=243671",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=248942",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=156613",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=972498",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=215443",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=248463",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=195136",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=179181",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=199930",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=203419",
"http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=220454",
]
def start_requests(self):
for url in self.start_urls:
yield Request(url=url, callback=self.parse_data, dont_filter=True)
def parse_data(self, response):
item = MikeItem()
serial = response.xpath(
'/html/body/table[1]/tr[2]/td[1]/text()').extract()
serial = serial[0] if serial else 'n/a'
item['serial'] = serial
yield item
sample output returned by this spider is as follows,
{'serial': u'207899'}
{'serial': u'971683'}
{'serial': u'214206'}
{'serial': u'159420'}
{'serial': u'248942'}
{'serial': u'243671'}
your code sounds good, try to add this function
class SonrisSpider(Spider):
def start_requests(self):
for url in self.start_urls:
print(url)
yield self.make_requests_from_url(url)
#the result of your code goes here
The URLs should be printed now. Test it, if not, say please

Using Scrapy for XML page

I'm trying to scrape multiple pages from an API to practice and develop my XML scrapping. One issue that has arisen is that when I try to scrape a document formatted like this: http://i.imgur.com/zJqeYvG.png and store it as an XML it fails to do so.
So within the CMD it fetches the URL it creates the XML file on my computer but there's nothing in it.
How would I fix it to echo out the whole document or even parts of it?
I put the code below:
from scrapy.spider import BaseSpider
from scrapy.selector import XmlXPathSelector
from doitapi.items import DoIt
import random
class MySpider(BaseSpider):
name = "craig"
allowed_domains = ["do-it.org.uk"]
start_urls = []
number = []
for count in range(100):
number.append(random.randint(2000000,2500000))
for i in number:
start_urls.append("http://www.do-it.org.uk/syndication/opportunities/%d?apiKey=XXXXX-XXXX-XXX-XXX-XXXXX" %i)
def parse(self, response):
xxs = XmlXPathSelector(response)
titles = xxs.register_namespace("d", "http://www.do-it.org.uk/volunteering-opportunity")
items = []
for titles in titles:
item = DoIt()
item ["url"] = response.url
item ["name"] = titles.select("//d:title").extract()
item ["description"] = titles.select("//d:description").extract()
item ["username"] = titles.select("//d:info-provider/name").extract()
item ["location"] = titles.select("//d:info-provider/address").extract()
items.append(item)
return items
Your XML file is using the namespace "http://www.do-it.org.uk/volunteering-opportunity" so to select title, name etc. you have 2 choices:
either use xxs.remove_namespaces() once and then use .select("./title"), .select("./description") etc.
or register the namespace once, with a prefix like "doit", xxs.register_namespace("doit", "http://www.do-it.org.uk/volunteering-opportunity"), and then use .select("./doit:title"), .select("./doit:description") etc.
For more details on XML namespaces, see this page in the FAQ and this page in the docs

Categories

Resources