scrapy python take out URLs from class name

scrapy python take out URLs from class name - python

I am new in scrapy
I would like to take all URLs from the class, for instance I have this code
#i did something like this but i did not work
myURLs = []
class CoolSpider(CrawlSpider):
name = 'cool'
allowed_domains = ['phooky.com']
start_urls = ['https://www.phooky.com/']
rules = (Rule(LinkExtractor(), callback="parse_obj", follow=True),)
def parse_obj(self, response):
item = response.url
myURLs.append(item)
print(item)
then finally when I put print(myURLs), nothing is showing
of course i run this in the command line
Thank you all

Related

Is this Sprite SVG possible to scrape?

Can I scrape this with standard Scrapy or do I need to use Selenium?
The html is:
<td class="example"><sprite-svg name="EXAMPLE2"><svg><use
xlink:href="/spritemap/1_0_30#sprite-EXAMPLE2"></use></svg></sprite-svg></td>
I need the value "EXAMPLE2" somehow.
The xpath which works in the browser is //td[#class='example']//*[local-name() = 'svg']
When I put it into scrapy I use the following code but am getting XPATH error.
'example' : div.xpath(".//td[#class='example']//*[local-name() = 'svg']
()").extract()
Any ideas how to scrape it?

Looking at the table, each svg sprite is under a class 'rug_X'
Something like
import scrapy
class RaceSpider(scrapy.Spider):
name = 'race'
allowed_domains = ['thedogs.com.au']
start_urls = ['https://www.thedogs.com.au/racing/gawler/2020-07-07/1/the-bunyip-maiden-stake-pr2-division1']
item = {}
def parse(self, response):
row = response.xpath('//tbody/tr')
dog = a.xpath('.//td[#class="table__cell--tight race-runners__name"]/div/a/text()').get()
number = a.xpath('.//td[#class="table__cell--tight race-runners__box"]/sprite-svg/#name').get()
cleaned_num = int(number.replace('rug_',''))
grade = a.xpath('.//td[#class="race-runners__grade"]/text()').get()
item = {'grade':grade, 'greyhound':dog,'rug':cleaned_num}
yield item
You could also use item loaders with a custom function to clean up the response you get.

Yes. You can do it with scrapy :
response.xpath("//td[#class='table__cell--tight race-runners__box']/sprite-svg/#name").getall()
Working scrapy code :
import scrapy
class Test(scrapy.Spider):
name = 'Test'
start_urls = [
'https://www.thedogs.com.au/racing/gawler/2020-07-07/1/the-bunyip-maiden-stake-pr2-division1']
def parse(self, response):
return {"nameList": response.xpath("//td[#class='table__cell--tight race-runners__box']/sprite-svg/#name").getall()}

ScraPy spider crawling but not exporting

I have a ScraPy Code that is running in shell, but when I try to export it to csv, it returns an empty file. It exports data when I do not go into a link and try to parse the description, but once I add the extra method of parsing the contents, it fails to work. Here is the code:
class MonsterSpider(CrawlSpider):
name = "monster"
allowed_domains = ["jobs.monster.com"]
base_url = "http://jobs.monster.com/v-technology.aspx?"
start_urls = [
"http://jobs.monster.com/v-technology.aspx"
]
for i in range(1,5):
start_urls.append(base_url + "page=" + str(i))
rules = (Rule(SgmlLinkExtractor(allow=("jobs.monster.com",))
, callback = 'parse_items'),)
def parse_items(self, response):
sel = Selector(response)
sites = sel.xpath('//div[#class="col-xs-12"]')
#items = []
for site in sites.xpath('.//article[#class="js_result_row"]'):
item = MonsterItem()
item['title'] = site.xpath('.//span[#itemprop = "title"]/text()').extract()
item['company'] = site.xpath('.//span[#itemprop = "name"]/text()').extract()
item['city'] = site.xpath('.//span[#itemprop = "addressLocality"]/text()').extract()
item['state'] = site.xpath('.//span[#itemprop = "addressRegion"]/text()').extract()
item['link'] = site.xpath('.//a[#data-m_impr_a_placement_id= "jsr"]/#href').extract()
follow = ''.join(item["link"])
request = Request(follow, callback = self.parse_dir_contents)
request.meta["item"] = item
yield request
#items.append(item)
#return items
def parse_dir_contents(self, response):
item = response.meta["item"]
item['desc'] = site.xpath('.//div[#itemprop = "description"]/text()').extract()
return item
Taking out the parse_dir_contents and uncommenting the empty "lists" list and "append" code was the original code.

Well, as #tayfun suggests you should use response.xpath or define the site variable.
By the way, you do not need to use sel = Selector(response). Responses come with the xpath function, there is no need to cover it into another selector.
However the main problem is that you restrict the domain of the spider. You define allowed_domains = ["jobs.monster.com"] however if you look at the URL to follow of your custom Request you can see that they are something like http://jobview.monster.com/ or http://job-openings.monster.com. In this case your parse_dir_contents is not executed (the domain is not allowed) and your item does not get returned so you won't get any results.
Change allowed_domains = ["jobs.monster.com"] to
allowed_domains = ["monster.com"]
and you will be fine and your app will work and return items.

You have an error in your parse_dir_contents method:
def parse_dir_contents(self, response):
item = response.meta["item"]
item['desc'] = response.xpath('.//div[#itemprop=description"]/text()').extract()
return item
Note the use of response. I don't know where you got site that you are currently using from.
Also, try to provide the error details when you post a question. Writing "it fails to work" doesn't say much.

Python: Scrapy start_urls list able to handle .format()?

I want to parse a list of stocks so I am trying to format the end of my start_urls list so I can just add the symbol instead of the entire url.
Spider class with start_urls inside stock_list method:
class MySpider(BaseSpider):
symbols = ["SCMP"]
name = "dozen"
allowed_domains = ["yahoo.com"]
def stock_list(stock):
start_urls = []
for symb in symbols:
start_urls.append("http://finance.yahoo.com/q/is?s={}&annual".format(symb))
return start_urls
def parse(self, response):
hxs = HtmlXPathSelector(response)
revenue = hxs.select('//td[#align="right"]')
items = []
for rev in revenue:
item = DozenItem()
item["Revenue"] = rev.xpath("./strong/text()").extract()
items.append(item)
return items[0:3]
It all runs correctly if I get rid of the stock_list and just do simple start_urls as normal, but as it currently is will not export more than an empty file.
Also, should I possibly try a sys.arv setup so that I would just type the stock symbol as an argument at the command line when I run $ scrapy crawl dozen -o items.csv???
Typically the shell prints out 2015-04-25 14:50:57-0400 [dozen] DEBUG: Crawled (200) <GET http://finance.yahoo.com/q/is?s=SCMP+Income+Statement&annual> among the LOG/DEBUG printout, however does not currently include it, implying it isn't correctly formatting the start_urls

The proper way for implementing dynamic start URL's is to use start_request().
Using start_urls is the preferred practice when you have a static list of starting URL's.
start_requests() This method must return an iterable with the first
Requests to crawl for this spider.
Example:
class MySpider(BaseSpider):
name = "dozen"
allowed_domains = ["yahoo.com"]
stock = ["SCMP", "APPL", "GOOG"]
def start_requests(self):
BASE_URL = "http://finance.yahoo.com/q/is?s={}"
yield scrapy.Request(url=BASE_URL.format(s)) for s in self.stock
def parse(self, response):
# parse the responses here
pass
This way you also use a generator instead of a pre-generated list, which scales better in case of a large stock.

I would use a for loop, like this:
class MySpider(BaseSpider):
stock = ["SCMP", "APPL", "GOOG"]
name = "dozen"
allowed_domains = ["yahoo.com"]
def stock_list(stock):
start_urls = []
for i in stock:
start_urls.append("http://finance.yahoo.com/q/is?s={}".format(i))
return start_urls
start_urls = stock_list(stock)
Then assign the function call as I have at the bottom.
UPDATE
Using Scrapy 0.24
# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import Selector
class MySpider(scrapy.Spider):
symbols = ["SCMP"]
name = "yahoo"
allowed_domains = ["yahoo.com"]
def stock_list(symbols):
start_urls = []
for symb in symbols:
start_urls.append("http://finance.yahoo.com/q/is?s={}&annual".format(symb))
return start_urls
start_urls = stock_list(symbols)
def parse(self, response):
revenue = Selector(response=response).xpath('//td[#align="right"]').extract()
print(revenue)
You may want to tweak the xpath to get exactly what you want; it seems to be pulling back a fair amount of stuff. But I've tested this and the scraping is working as expected.

How to crawl only two pre-defined pages, but they scrape different items?

I'm trying to scrape a site driven by some user input. For example, the user gives me the pid of a product and a name, and a separate program will launch the spider, gather the data, and return it to the user.
However, the only information I want are product and person which are found in two links to an xml. If I know these two links and the pattern, how do I build the callback to parse the different items?
For example, if I have these two Items defined:
class PersonItem(Item):
name = Field()
...
class ProductItem(Item):
pid = Field()
...
And I know their links have pattern:
www.example.com/person/*<name_of_person>*/person.xml
www.example.com/*<product_pid>*/product.xml
Then my spider would look something like this:
class MySpider(BaseSpider):
name = "myspider"
# simulated given by user
pid = "4545-fw"
person = "bob"
allowed_domains = ["http://www.example.com"]
start_urls = ['http://www.example.com/person/%s/person.xml'%person, 'http://www.example.com/%s/product.xml'%pid]
def parse(self, response):
# Not sure here if scrapping person or item
I know that I can define rules too using Rule(SgmlLinkExtractor()) and then giving the person and product each its own parse callback. However, I'm not sure how they apply here since I think rules are meant for crawling deeper, whereas I only need to scrape the surface level.

If you want to be retro-active you could put your logic in parse():
def parse(self, response):
if 'person.xml' in response.url:
item = PersonItem()
elif 'product.xml' in response.url:
item = ProductItem()
else:
raise Exception('Could not determine item type')
UPDATE:
If you want to be pro-active you could override start_requests():
class MySpider(BaseSpider):
name = "myspider"
allowed_domains = ["example.com"]
pid = "4545-fw"
person = "bob"
def start_requests(self):
start_urls = (
('http://www.example.com/person/%s/person.xml' % self.person, PersonItem),
('http://www.example.com/%s/product.xml' % self.pid, ProductItem),
)
for url, cls in start_urls:
yield Request(url, meta=dict(cls=cls))
def parse(self, response):
item = response.meta['cls']()

scrapy: newbie attempting to debug code

Total newbie, trying to get scrapy to read a list of urls from csv and return the items in a csv.
Need some help to figure out where I'm going wrong here:
Spider code:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
import random
class incyspider(BaseSpider):
name = "incyspider"
def __init__(self):
super(incyspider, self).__init__()
domain_name = "incyspider.co.uk"
f = open("urls.csv")
start_urls = [url.strip() for url in f.readlines()]
f.close
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[#class="Product"]')
items = []
for site in sites:
item['title'] = hxs.select('//div[#class="Name"]/node()').extract()
item['hlink'] = hxs.select('//div[#class="Price"]/node()').extract()
item['price'] = hxs.select('//div[#class="Codes"]/node()').extract()
items.append(item)
return items
SPIDER = incyspider()
Here's the items.py code:
from scrapy.item import Item, Field
class incyspider(Item):
# define the fields for your item here like:
# name = Field()
title = Field()
hlink = Field()
price = Field()
pass
To run, I'm using
scrapy crawl incyspider -o items.csv -t csv
I would seriously appreciate any pointers.

I'm not exactly sure but after a quick look at your code I would say that at least you need to replace this line
sites = hxs.select('//div[#class="Product"]')
by this line
sites = hxs.select('//div[#class="Product"]').extract()

As a first punt at answering this, your spider code is missing an import for your incyspider item class. Also you're not creating an instance of any kind of item to store the title/hlink/price info, so the items.append(item) line might complain.
Since your spider is also called incyspider, you should rename the item to be something like incyspiderItem and then add the following line to your spider code
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
import random
from incyspider.items import incyspiderItem
class incyspider(BaseSpider):
name = "incyspider"
def __init__(self):
super(incyspider, self).__init__()
domain_name = "incyspider.co.uk"
f = open("urls.csv")
start_urls = [url.strip() for url in f.readlines()]
f.close
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[#class="Product"]')
items = []
for site in sites:
item = incyspiderItem()
item['title'] = hxs.select('//div[#class="Name"]/node()').extract()
item['hlink'] = hxs.select('//div[#class="Price"]/node()').extract()
item['price'] = hxs.select('//div[#class="Codes"]/node()').extract()
items.append(item)
return items
If I'm wrong, then please edit the question to explain how you know there is a problem with the code eg: is the expected output different to the actual output? If so, how?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

scrapy python take out URLs from class name - python

Related

Is this Sprite SVG possible to scrape?

ScraPy spider crawling but not exporting

Python: Scrapy start_urls list able to handle .format()?

How to crawl only two pre-defined pages, but they scrape different items?

scrapy: newbie attempting to debug code

Categories

Resources