Python: Scrapy start_urls list able to handle .format()?

Python: Scrapy start_urls list able to handle .format()? - python

I want to parse a list of stocks so I am trying to format the end of my start_urls list so I can just add the symbol instead of the entire url.
Spider class with start_urls inside stock_list method:
class MySpider(BaseSpider):
symbols = ["SCMP"]
name = "dozen"
allowed_domains = ["yahoo.com"]
def stock_list(stock):
start_urls = []
for symb in symbols:
start_urls.append("http://finance.yahoo.com/q/is?s={}&annual".format(symb))
return start_urls
def parse(self, response):
hxs = HtmlXPathSelector(response)
revenue = hxs.select('//td[#align="right"]')
items = []
for rev in revenue:
item = DozenItem()
item["Revenue"] = rev.xpath("./strong/text()").extract()
items.append(item)
return items[0:3]
It all runs correctly if I get rid of the stock_list and just do simple start_urls as normal, but as it currently is will not export more than an empty file.
Also, should I possibly try a sys.arv setup so that I would just type the stock symbol as an argument at the command line when I run $ scrapy crawl dozen -o items.csv???
Typically the shell prints out 2015-04-25 14:50:57-0400 [dozen] DEBUG: Crawled (200) <GET http://finance.yahoo.com/q/is?s=SCMP+Income+Statement&annual> among the LOG/DEBUG printout, however does not currently include it, implying it isn't correctly formatting the start_urls

The proper way for implementing dynamic start URL's is to use start_request().
Using start_urls is the preferred practice when you have a static list of starting URL's.
start_requests() This method must return an iterable with the first
Requests to crawl for this spider.
Example:
class MySpider(BaseSpider):
name = "dozen"
allowed_domains = ["yahoo.com"]
stock = ["SCMP", "APPL", "GOOG"]
def start_requests(self):
BASE_URL = "http://finance.yahoo.com/q/is?s={}"
yield scrapy.Request(url=BASE_URL.format(s)) for s in self.stock
def parse(self, response):
# parse the responses here
pass
This way you also use a generator instead of a pre-generated list, which scales better in case of a large stock.

I would use a for loop, like this:
class MySpider(BaseSpider):
stock = ["SCMP", "APPL", "GOOG"]
name = "dozen"
allowed_domains = ["yahoo.com"]
def stock_list(stock):
start_urls = []
for i in stock:
start_urls.append("http://finance.yahoo.com/q/is?s={}".format(i))
return start_urls
start_urls = stock_list(stock)
Then assign the function call as I have at the bottom.
UPDATE
Using Scrapy 0.24
# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import Selector
class MySpider(scrapy.Spider):
symbols = ["SCMP"]
name = "yahoo"
allowed_domains = ["yahoo.com"]
def stock_list(symbols):
start_urls = []
for symb in symbols:
start_urls.append("http://finance.yahoo.com/q/is?s={}&annual".format(symb))
return start_urls
start_urls = stock_list(symbols)
def parse(self, response):
revenue = Selector(response=response).xpath('//td[#align="right"]').extract()
print(revenue)
You may want to tweak the xpath to get exactly what you want; it seems to be pulling back a fair amount of stuff. But I've tested this and the scraping is working as expected.

Related

scrapy python take out URLs from class name

I am new in scrapy
I would like to take all URLs from the class, for instance I have this code
#i did something like this but i did not work
myURLs = []
class CoolSpider(CrawlSpider):
name = 'cool'
allowed_domains = ['phooky.com']
start_urls = ['https://www.phooky.com/']
rules = (Rule(LinkExtractor(), callback="parse_obj", follow=True),)
def parse_obj(self, response):
item = response.url
myURLs.append(item)
print(item)
then finally when I put print(myURLs), nothing is showing
of course i run this in the command line
Thank you all

Scrapy returns all value in a single cell

i'm trying to scrape this site using scrapy but returns all the value in a
single cell, i except each value in a different row.
example:
milage: 25
milage: 377
milage: 247433
milage: 464130
but i'm getting the data like this
example:
milage:[u'25',
u'377',
u'247433',
u'399109',
u'464130',
u'399631',
u'435238',
u'285000',
u'287470',
u'280000']
here is my code
import scrapy
from ..items import ExampleItem
from scrapy.selector import HtmlXPathSelector
url = 'https://example.com'
class Example(scrapy.Spider):
name = 'example'
allowed_domains = ['www.example.com']
start_urls = [url]
def parse(self, response):
hxs = HtmlXPathSelector(response)
item_selector = hxs.select('//div[#class="listing_format card5 relative"]')
for fields in item_selector:
item = ExampleItem()
item ['Mileage'] = fields.select('//li[strong="Mileage"]/span/text()').extract()
yield item

You didn't show your site but may be you need relative XPath:
item ['Mileage'] = fields.select('.//li[strong="Mileage"]/span/text()').extract_first()

It sounds like you need to iterate over your milages.
for fields in item_selector:
milages = fields.select('//li[strong="Mileage"]/span/text()').extract()
for milage in milages:
item = CommercialtrucktraderItem()
item ['Mileage'] = milage
yield item
Also consider making your fields.select('//li[strong="Mileage"]/span/text()').extract() more specific?

Pass variable to test.py in spider folder using scrapy

I'm using Scrapy. The following is the code for test.py in spider folder.
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from craigslist_sample.items import CraigslistSampleItem
class MySpider(BaseSpider):
name = "craig"
allowed_domains = ["craigslist.org"]
start_urls = ["http://seattle.craigslist.org/npo/"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select("//span[#class='pl']")
items = []
for titles in titles:
item = CraigslistSampleItem()
item["title"] = titles.select("a/text()").extract()
item["link"] = titles.select("a/#href").extract()
items.append(item)
return items
Essentially, I want to iterate my url list and pass url into MySpider class for start_ulrs. Could you anyone give me suggestion on how to make this?

Instead of having "statically defined" start_urls you need to override start_requests() method:
from scrapy.http import Request
class MySpider(BaseSpider):
name = "craig"
allowed_domains = ["craigslist.org"]
def start_requests(self)
list_of_urls = [...] # reading urls from a text file, for example
for url in list_of_urls:
yield Request(url)
def parse(self, response):
...

Pass input file to scrapy containing a list of domains to be scraped

I saw this link [a link] (Pass Scrapy Spider a list of URLs to crawl via .txt file)!
This changes the list of start urls. I want to scrape webpages for each domain(from a file) and put results into a separate file(named after the domain).
I have scraped data for a website but I have specified the start url and allowed_domains in the spider itself. How to change this using input file.
Update 1:
This is the code that I tried:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
class AppleItem(Item):
reference_link = Field()
rss_link = Field()
class AppleSpider(CrawlSpider):
name = 'apple'
allowed_domains = []
start_urls = []
def __init__(self):
for line in open('./domains.txt', 'r').readlines():
self.allowed_domains.append(line)
self.start_urls.append('http://%s' % line)
rules = [Rule(SgmlLinkExtractor(allow=()), follow=True, callback='parse_item')]
def parse_item(self, response):
sel = HtmlXPathSelector(response)
rsslinks = sel.select('//a[contains(#href, "pdf")]/#href').extract()
items = []
for rss in rsslinks:
item = AppleItem()
item['reference_link'] = response.url
item['rss_link'] = rsslinks
items.append(item)
filename = response.url.split("/")[-2]
open(filename+'.csv', 'wb').write(items)
I get an error when I run this: AttributeError: 'AppleSpider' object has no attribute '_rules'

You can use __init__ method of spider class to read file and owerrite start_urls and allowed_domains.
Suppose we have file domains.txt with content:
example1.com
example2.com
...
Example:
class MySpider(BaseSpider):
name = "myspider"
allowed_domains = []
start_urls = []
def __init__(self):
for line in open('./domains.txt', 'r').readlines():
self.allowed_domains.append(line)
self.start_urls.append('http://%s' % line)
def parse(self, response):
# here you will get data parsing page
# than put your data into single file
# from scrapy toturial http://doc.scrapy.org/en/latest/intro/tutorial.html
filename = response.url.split("/")[-2]
open(filename, 'wb').write(your_data)

scrapy: newbie attempting to debug code

Total newbie, trying to get scrapy to read a list of urls from csv and return the items in a csv.
Need some help to figure out where I'm going wrong here:
Spider code:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
import random
class incyspider(BaseSpider):
name = "incyspider"
def __init__(self):
super(incyspider, self).__init__()
domain_name = "incyspider.co.uk"
f = open("urls.csv")
start_urls = [url.strip() for url in f.readlines()]
f.close
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[#class="Product"]')
items = []
for site in sites:
item['title'] = hxs.select('//div[#class="Name"]/node()').extract()
item['hlink'] = hxs.select('//div[#class="Price"]/node()').extract()
item['price'] = hxs.select('//div[#class="Codes"]/node()').extract()
items.append(item)
return items
SPIDER = incyspider()
Here's the items.py code:
from scrapy.item import Item, Field
class incyspider(Item):
# define the fields for your item here like:
# name = Field()
title = Field()
hlink = Field()
price = Field()
pass
To run, I'm using
scrapy crawl incyspider -o items.csv -t csv
I would seriously appreciate any pointers.

I'm not exactly sure but after a quick look at your code I would say that at least you need to replace this line
sites = hxs.select('//div[#class="Product"]')
by this line
sites = hxs.select('//div[#class="Product"]').extract()

As a first punt at answering this, your spider code is missing an import for your incyspider item class. Also you're not creating an instance of any kind of item to store the title/hlink/price info, so the items.append(item) line might complain.
Since your spider is also called incyspider, you should rename the item to be something like incyspiderItem and then add the following line to your spider code
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
import random
from incyspider.items import incyspiderItem
class incyspider(BaseSpider):
name = "incyspider"
def __init__(self):
super(incyspider, self).__init__()
domain_name = "incyspider.co.uk"
f = open("urls.csv")
start_urls = [url.strip() for url in f.readlines()]
f.close
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[#class="Product"]')
items = []
for site in sites:
item = incyspiderItem()
item['title'] = hxs.select('//div[#class="Name"]/node()').extract()
item['hlink'] = hxs.select('//div[#class="Price"]/node()').extract()
item['price'] = hxs.select('//div[#class="Codes"]/node()').extract()
items.append(item)
return items
If I'm wrong, then please edit the question to explain how you know there is a problem with the code eg: is the expected output different to the actual output? If so, how?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Scrapy start_urls list able to handle .format()? - python

Related

scrapy python take out URLs from class name

Scrapy returns all value in a single cell

Pass variable to test.py in spider folder using scrapy

Pass input file to scrapy containing a list of domains to be scraped

scrapy: newbie attempting to debug code

Categories

Resources