scrapy returning too many rows in tables

scrapy returning too many rows in tables - python

Feels like I'm not grasping some concepts here, or trying to fly before I can crawl (pun intended).
There are indeed 5 tables on the page, with the one I'm interested in being the 3rd. But executing this:
#!/usr/bin/python
# python 3.x
import sys
import os
import re
import requests
import scrapy
class iso3166_spider( scrapy.Spider):
name = "countries"
def start_requests( self):
urls = ["https://en.wikipedia.org/wiki/ISO_3166-1"]
for url in urls:
yield scrapy.Request( url=url, callback=self.parse)
def parse( self, response):
title = response.xpath('//title/text()').get()
print("-- title -- {0}".format(title))
list_table_selector = response.xpath('//table') # gets all tables on page
print("-- table count -- {0}".format( len( list_table_selector)))
table_selector = response.xpath('//table[2]') # inspect to figure out which one u want
table_selector_text = table_selector.getall() # got the right table, starts with Afghanistan
# print( table_selector_text)
#
# here is where things go wrong
list_row_selector = table_selector.xpath('//tr')
print("number of rows in table: {0}".format( len( list_row_selector))) # gives 302, should be close to 247
for i in range(0,20):
row_selector = list_row_selector[i]
row_selector_text = row_selector.getall()
print("i={0}, getall:{1}".format(i, row_selector_text)
prints the getall() of each row in EVERY table - I see the row for Afghanistan as row 8 not row 2
Changing
list_row_selector = table_selector.xpath('//tr')
to
list_row_selector = table_selector.xpath('/tr')
results in zero rows found where I'd expect roughly 247
Ultimately I want the name and three codes for each country, should be straightforward.
What am I doing wrong?
TIA,
kerchunk

tbl = response.xpath("//th[starts-with(text(),'English short name')]/ancestor::table/tbody/tr[position()>1]") # try this xpath. I check the source of web page, the header ("th" elements) line is under tbody also.
You can also try to replace "//tr" with ".//tr"

Related

When accessing a class variable updated in a method, its updated value is not picked up in another method in scrapy spider

I am trying to parse a public forum that contains multiple threads. I need to store metadata of that thread. These metadata appear before getting inside the thread i.e in the page which displays the list of discussion threads.
In my scrapy code below, I need to access values from parse() method in parse_contents() method. I am storing those values in class variables but the parse_contents() picks up the first value that was assigned the very first time although the new value has been assigned before calling parse_contents().
Here is my spider class
import scrapy
import re
import pandas as pd
import time
from functools import reduce
from ..items import PostsItem
class SpiderSpider(scrapy.Spider):
name = 'posts'
page_count = 1
forum_count = 0
#Create an item container to store all this data
post_item = PostsItem()
# I want these variables to parse_contents() method
post_subject_last_message_date = ""
total_posts = 0
start_urls = [
# 'https://www.dcurbanmom.com/jforum/posts/list/150/946237.page'
'https://www.dcurbanmom.com/jforum/forums/show/32.page'
]
# Grabs the list of threads in the DCPS forum
def parse(self, response):
for next_forum in response.xpath('//span[#class="topictitle"]'):
next_forum_link = next_forum.xpath('.//a/#href')
next_forum_url = response.urljoin(next_forum_link.extract_first())
last_message = next_forum.xpath('.//ancestor::td[1]/following-sibling::td[4]/span/text()')
self.post_subject_last_message_date = last_message.get() #This needs to be picked up by parse_contents
yield scrapy.Request(url = next_forum_url, callback=self.parse_contents)
#Get next page of duscussion threads list
#Some code here
#Parses individual discussion thread
def parse_contents(self, response):
all_posts = response.xpath('//table[#class="forumline"]//tr')
post_text = ""
for post in all_posts:
post_text_response = post.xpath(".//div[#class='postbody']/br/following-sibling::text()[1] | .//div[#class='postbody']/br/following-sibling::a[1]/text() | .//div[#class='postbody']/text() | .//div[#class='postbody']/a/text()")
if(len(post_text_response.getall())>0):
post_text = "".join(re.sub('\r','',x) for x in post_text_response.getall()).strip()
#Populate the item container
if(bool(re.search(r'^\s*$', post_text))==False):
self.post_item['post_message'] = post_text
# !!! This is not picking up the value updated in the parse method !!!
self.post_item['post_subject_last_message_date'] = self.post_subject_last_message_date
post_text = ""
yield(self.post_item)
# Go to next page in this discussion thread
# Some code here
How can I fix this?
Edit: removed some lines of code to make it easier to read

replacing yield scrapy.Request(url = next_forum_url, callback=self.parse_contents) with the following fixed it for me
yield scrapy.Request(url = next_forum_url, callback=self.parse_contents, cb_kwargs = {
'post_subject_answers': post_subject_answer,
'post_subject_first_post_date':post_subject_first_post_date,
'post_subject_views':post_subject_views,
'post_subject_last_message_date':post_subject_last_message_date
})

Only items from first Beautiful Soup object are being added to my lists

I suspect this isn't very complicated, but I can't see to figure it out. I'm using Selenium and Beautiful Soup to parse Petango.com. Data will be used to help a local shelter understand how they compare in different metrics to other area shelters. so next will be taking these data frames and doing some analysis.
I grab detail urls from a different module and import the list here.
My issue is, my lists are only showing the value from the HTML from the first dog. I was stepping through and noticed my len are different for the soup iterations, so I realize my error is after that somewhere but I can't figure out how to fix.
Here is my code so far (running the whole process vs using a cached page)
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
from Petango import pet_links
headings = []
values = []
ShelterInfo = []
ShelterInfoWebsite = []
ShelterInfoEmail = []
ShelterInfoPhone = []
ShelterInfoAddress = []
Breed = []
Age = []
Color = []
SpayedNeutered = []
Size = []
Declawed = []
AdoptionDate = []
# to access sites, change url list to pet_links (break out as needed) and change if false to true. false looks to the html file
url_list = (pet_links[4], pet_links[6], pet_links[8])
#url_list = ("Petango.html", "Petango.html", "Petango.html")
for link in url_list:
page_source = None
if True:
#pet page = link should populate links from above, hard code link was for 1 detail page, =to hemtl was for cached site
PetPage = link
#PetPage = 'https://www.petango.com/Adopt/Dog-Terrier-American-Pit-Bull-45569732'
#PetPage = Petango.html
PetDriver = webdriver.Chrome(executable_path='/Users/paulcarson/Downloads/chromedriver')
PetDriver.implicitly_wait(30)
PetDriver.get(link)
page_source = PetDriver.page_source
PetDriver.close()
else:
with open("Petango.html",'r') as f:
page_source = f.read()
PetSoup = BeautifulSoup(page_source, 'html.parser')
print(len(PetSoup.text))
#get the details about the shelter and add to lists
ShelterInfo.append(PetSoup.find('div', class_ = "DNNModuleContent ModPethealthPetangoDnnModulesShelterShortInfoC").find('h4').text)
ShelterInfoParagraphs = PetSoup.find('div', class_ = "DNNModuleContent ModPethealthPetangoDnnModulesShelterShortInfoC").find_all('p')
First_Paragraph = ShelterInfoParagraphs[0]
if "Website" not in First_Paragraph.text:
raise AssertionError("first paragraph is not about site")
ShelterInfoWebsite.append(First_Paragraph.find('a').text)
Second_Paragraph = ShelterInfoParagraphs[1]
ShelterInfoEmail.append(Second_Paragraph.find('a')['href'])
Third_Paragraph = ShelterInfoParagraphs[2]
ShelterInfoPhone.append(Third_Paragraph.find('span').text)
Fourth_Paragraph = ShelterInfoParagraphs[3]
ShelterInfoAddress.append(Fourth_Paragraph.find('span').text)
#get the details about the pet
ul = PetSoup.find('div', class_='group details-list').ul # Gets the ul tag
li_items = ul.find_all('li') # Finds all the li tags within the ul tag
for li in li_items:
heading = li.strong.text
headings.append(heading)
value = li.span.text
if value:
values.append(value)
else:
values.append(None)
Breed.append(values[0])
Age.append(values[1])
print(Age)
Color.append(values[2])
SpayedNeutered.append(values[3])
Size.append(values[4])
Declawed.append(values[5])
AdoptionDate.append(values[6])
ShelterDF = pd.DataFrame(
{
'Shelter': ShelterInfo,
'Shelter Website': ShelterInfoWebsite,
'Shelter Email': ShelterInfoEmail,
'Shelter Phone Number': ShelterInfoPhone,
'Shelter Address': ShelterInfoAddress
})
PetDF = pd.DataFrame(
{'Breed': Breed,
'Age': Age,
'Color': Color,
'Spayed/Neutered': SpayedNeutered,
'Size': Size,
'Declawed': Declawed,
'Adoption Date': AdoptionDate
})
print(PetDF)
print(ShelterDF)
output from print len and print the age value as the loop progresses
12783
['6y 7m']
10687
['6y 7m', '6y 7m']
10705
['6y 7m', '6y 7m', '6y 7m']
Could someone please point me in the right direction?
Thank you for your help!
Paul

You need to change the find method into find_all() in BeautifulSoup so it locates all the elements.

Values is global and you only ever append the first value in this list to Age
Age.append(values[1])
Same problem for your other global lists (static index whether 1 or 2 etc...).
You need a way to track the appropriate index to use perhaps through a counter, or determine other logic to ensure current value is added e.g. with current Age, is it is the second li in the loop? Or just append PetSoup.select_one("[data-bind='text: age']").text
It looks like each item of interest e.g. colour, spayed contains the data-bind attribute so you can use those with the appropriate attribute value to select each value and avoid a loop over li elements.
e.g. current_colour = PetSoup.select_one("[data-bind='text: color']").text
Best to set in a variable and test is not None before accessing with .text

Added iterating over page id in Scrapy, responses in parse method no longer run

I have a few print functions in my spider for debugging. In the start_request function, I'm generating urls by adding numbers in the range [0,4] with base url which gets parsed by parse_grant function.In that function, first print function gets called, but second does not.
Still learning here, so I may have made a stupid mistake and don't quite understand what's happening with Twisted in the background.
# -*- coding: utf-8 -*-
from scrapy.spiders import Spider, Rule
from scrapy.http import Request
from scraper_app.items import NSERCGrant
from scrapy.selector import Selector
class NSERC_Spider(Spider):
name = 'NSERCSpider'
allowed_domains = ["http://www.nserc-crsng.gc.ca"]
# Maximum page id to use.
max_id = 5
def start_requests(self):
for i in range(self.max_id):
if i == 0:
continue
yield Request("http://www.nserc-crsng.gc.ca/ase-oro/Details-Detailles_eng.asp?id=%d" % i,
callback=self.parse_grant)
def parse_grant(self, response):
print("Being called")
sel = Selector(response)
grants = sel.xpath('.//html//body')
items = []
for response in grants:
print("Responses being called")
item = NSERCGrant()
# Row one
item['Competition_Year'] = response.xpath('.//tr[1]//td[2]//text()').extract()
item['Fiscal_Year'] = response.xpath('.//tr[1]//td[4]//text()').extract()
# Row two
item['Project_Lead_Name'] = response.xpath('.//tr[2]//td[2]//text()').extract()
item['Institution'] = response.xpath('.//tr[2]//td[4]//text()').extract()
# Row three
item['Department'] = response.xpath('.//tr[3]//td[2]//text()').extract()
item['Province'] = response.xpath('.//tr[3]//td[4]//text()').extract()
# Row four
item['Award_Amount'] = response.xpath('.//tr[4]//td[2]//text()').extract()
item['Installment'] = response.xpath('.//tr[4]//td[4]//text()').extract()
# Row five
item['Program'] = response.xpath('.//tr[5]//td[2]//text()').extract()
item['Selection_Committee'] = response.xpath('.//tr[5]//td[4]//text()').extract()
# Row six
item['Research_Subject'] = response.xpath('.//tr[6]//td[2]//text()').extract()
item['Area_of_Application'] = response.xpath('.//tr[6]//td[4]//text()').extract()
# Row seven
item['Co_Researchers'] = response.xpath(".//tr[7]//td[2]//text()").extract()
item['Partners'] = response.xpath('.//tr[7]//td[4]//text()').extract()
# Award Summary
item['Award_Summary'] = response.xpath('.//p//text()').extract()
items.append(item)
return items

The information you are looking for only occurs once on each page and the body tag is on every page so the loop and the line
grants = sel.xpath('.//html//body')
are redundant. Also, response.xpath('... your xpath here ...') saves some code. Try this
# -*- coding: utf-8 -*-
from scrapy.spiders import Spider
from scrapy.http import Request
from scraper_app.items import NSERCGrant
class NSERC_Spider(Spider):
name = 'NSERCSpider'
allowed_domains = ["http://www.nserc-crsng.gc.ca"]
# Maximum page id to use.
max_id = 5
def start_requests(self):
for i in range(1, self.max_id):
yield Request("http://www.nserc-crsng.gc.ca/ase-oro/Details-Detailles_eng.asp?id=%d" % i,
callback=self.parse_grant)
def parse_grant(self, response):
print("Being called")
item = NSERCGrant()
# Row one
item['Competition_Year'] = response.xpath('//tr[1]//td[2]//text()').extract()
item['Fiscal_Year'] = response.xpath('//tr[1]//td[4]//text()').extract()
# Row two
item['Project_Lead_Name'] = response.xpath('//tr[2]//td[2]//text()').extract()
item['Institution'] = response.xpath('//tr[2]//td[4]//text()').extract()
# Row three
item['Department'] = response.xpath('//tr[3]//td[2]//text()').extract()
item['Province'] = response.xpath('//tr[3]//td[4]//text()').extract()
# Row four
item['Award_Amount'] = response.xpath('//tr[4]//td[2]//text()').extract()
item['Installment'] = response.xpath('//tr[4]//td[4]//text()').extract()
# Row five
item['Program'] = response.xpath('//tr[5]//td[2]//text()').extract()
item['Selection_Committee'] = response.xpath('//tr[5]//td[4]//text()').extract()
# Row six
item['Research_Subject'] = response.xpath('//tr[6]//td[2]//text()').extract()
item['Area_of_Application'] = response.xpath('//tr[6]//td[4]//text()').extract()
# Row seven
item['Co_Researchers'] = response.xpath("//tr[7]//td[2]//text()").extract()
item['Partners'] = response.xpath('//tr[7]//td[4]//text()').extract()
# Award Summary
item['Award_Summary'] = response.xpath('//p//text()').extract()
yield item
I've also tweaked your start_request routine to remove the if i = 0.
Take a look at scrapy shell which allows you to try out your xpaths and see the results interactively.

When I try
grants = sel.xpath('.//html//body')
from my scrapy shell, this is what I get
In [10]: grants = sel.xpath('.//html//body')
In [11]: grants
Out[11]: []
When I change it to the following code,
In [12]: grants = sel.xpath('/html/body')
In [13]: grants
Out[13]: [<Selector xpath='/html/body' data=u'<body>\r\n<div id="cn-body-inner-1col">\r\n<'>]

BeautifulSoup returning unrelated HTML

I'm trying to parse basketball stat data from pages like http://www.sports-reference.com/cbb/boxscores/2014-11-14-kentucky.html. I'm using Python 2.7.6 and BeautifulSoup 4-4.3.2. I'm searching gamelogs like the above page for the class "sortable" in order to get access to the raw stat data contained within the tables. I am only interested in the "Basic Stats" for each team.
However, the HTML that BeautifulSoup is returning is not at all what I expect. Instead I get a list of all-time team records and data for every school that has ever played. I don't have enough reputation to post a second link here of the output or I would.
Basically, there are four class "sortable" tables on the boxscore page. When I ask BS to find them by the only way I can think of to distinguish them from the other data, it instead returns completely irrelevant data and I can't even figure out where the returned data comes from.
Here's the code:
import urllib2
import re
import sys
from bs4 import BeautifulSoup
class Gamelogs():
def __init__(self):
#the base bage that has all boxscore links
self.teamPageSoup = BeautifulSoup(urllib2.urlopen(
'http://www.sports-reference.com/cbb/schools/' + school +
'/2015-gamelogs.html'))
#use regex to only find links with score data
self.statusPageLinks = self.teamPageSoup.findAll(href=re.compile(
"boxscores"));
def scoredata(links, school):
#for each link in the school's season
for l in links:
gameSoup = BeautifulSoup(urllib2.urlopen(l))
#remove extra link formatting to get just filename alone
l = l[59+len(school):]
#open a local file with that filename to store the results
fo = open(str(l),"w")
#create a list that will hold the box score data only
output = gameSoup.findAll(class_="sortable")
#write it line by line to the file that was just opened
for o in output:
fo.write(str(o) + '\n')
fo.close
def getlinks(school):
gamelogs = Gamelogs()
#open a new file to store the output
fo = open(school + '.txt',"w")
#remove extraneous links
gamelogs.statusPageLinks = gamelogs.statusPageLinks[2:]
#create the list that will hold each school's seasonlong boxscores
boxlinks = list()
for s in gamelogs.statusPageLinks:
#make the list element a string so it can be sliced
string = str(s)
#remove extra link formatting
string = string[9:]
string = string[:-16]
#create the full list of games per school
boxlinks.insert(0, 'http://www.sports-reference.com/cbb/schools/'
+ school + string)
scoredata(boxlinks, school)
if __name__ == '__main__':
#for each school as a commandline argument
for arg in sys.argv[1:]:
school = arg
getlinks(school)
Is this a problem with BS, my code, or the site? T

It looks like this is an issue with your code. The page that you are getting back sounds like this one: http://www.sports-reference.com/cbb/schools/?redir
Whenever I enter an invalid school name I am redirected to a page showing stats for 477 different teams. FYI: team names in the url are also case sensitive.

How do I request callback on a URL that I first scraped to get?

Just started toying around with scrapy for a bit to help scrape some fantasy basketball stats. My main problem is in my spider - how do I scrape the href attribute of a link and then callback another parser on that url?
I looked into link extractors, and I think this might be my solution but I'm not sure. I've re-read it over and over again, and still am confused on where to start. The following is the code I have so far.
def parse_player(self, response):
player_name = "Steven Adams"
sel = Selector(response)
player_url = sel.xpath("//a[text()='%s']/#href" % player_name).extract()
return Request("http://sports.yahoo.com/'%s'" % player_url, callback = self.parse_curr_stats)
def parse_curr_stats(self, response):
sel = Selector(response)
stats = sel.xpath("//div[#id='mediasportsplayercareerstats']//table[#summary='Player']/tbody/tr[last()-1]")
items =[]
for stat in stats:
item = player_item()
item['fgper'] = stat.xpath("td[#title='Field Goal Percentage']/text()").extract()
item['ftper'] = stat.xpath("td[#title='Free Throw Percentage']/text()").extract()
item['treys'] = stat.xpath("td[#title='3-point Shots Made']/text()").extract()
item['pts'] = stat.xpath("td[#title='Points']/text()").extract()
item['reb'] = stat.xpath("td[#title='Total Rebounds']/text()").extract()
item['ast'] = stat.xpath("td[#title='Assists']/text()").extract()
item['stl'] = stat.xpath("td[#title='Steals']/text()").extract()
item['blk'] = stat.xpath("td[#title='Blocked Shots']/text()").extract()
item['tov'] = stat.xpath("td[#title='Turnovers']/text()").extract()
item['fga'] = stat.xpath("td[#title='Field Goals Attempted']/text()").extract()
item['fgm'] = stat.xpath("td[#title='Field Goals Made']/text()").extract()
item['fta'] = stat.xpath("td[#title='Free Throws Attempted']/text()").extract()
item['ftm'] = stat.xpath("td[#title='Free Throws Made']/text()").extract()
items.append(item)
return items
So as you can see, in the first parse function, you're given a name, and you look for the link on the page that will guide you to their individual page, which is stored in "player_url". How do I then go to that page and run the 2nd parser on it?
I feel as if I am completely glossing over something and if someone could shed some light it would be greatly appreciated!

When you want to send a Request object, just use yield rather than return like this:
def parse_player(self, response):
......
yield Request(......)
If there are many Request objects that you want to send in a single parse method, a best practic is like this:
def parse_player(self, response):
......
res_objs = []
# then add every Request object into 'res_objs' list,
# and in the end of the method, do the following:
for req in res_objs:
yield req
I think when the scrapy spider is running, it will handle requests under the hood like this:
# handle requests
for req_obj in self.parse_play():
# do something with *Request* object
So just remember use yield to send Request objects.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

scrapy returning too many rows in tables - python

tbl = response.xpath("//th[starts-with(text(),'English short name')]/ancestor::table/tbody/tr[position()>1]") # try this xpath. I check the source of web page, the header ("th" elements) line is under tbody also. You can also try to replace "//tr" with ".//tr"

Related

When accessing a class variable updated in a method, its updated value is not picked up in another method in scrapy spider

Only items from first Beautiful Soup object are being added to my lists

Added iterating over page id in Scrapy, responses in parse method no longer run

BeautifulSoup returning unrelated HTML

How do I request callback on a URL that I first scraped to get?

Categories

Resources