I've build my 1st Scrapy project but can't figure out the last hurdle.
With my script below I get one long list in csv. First all the Product Prices and than all the Product Names.
What I would like to achieve is that for every Product the price is next to in.
For example:
Product Name, Product Price
Product Name, Product Price
My scrapy project:
Items.py
from scrapy.item import Item, Field
class PrijsvergelijkingItem(Item):
Product_ref = Field()
Product_price = Field()
My Spider called nvdb.py:
from scrapy.spider import BaseSpider
import scrapy.selector
from Prijsvergelijking.items import PrijsvergelijkingItem
class MySpider(BaseSpider):
name = "nvdb"
allowed_domains = ["vandenborre.be"]
start_urls = ["http://www.vandenborre.be/tv-lcd-led/lcd-led-tv-80-cm-alle-producten"]
def parse(self, response):
hxs = scrapy.Selector(response)
titles = hxs.xpath("//ul[#id='prodlist_ul']")
items = []
for titles in titles:
item = PrijsvergelijkingItem()
item["Product_ref"] = titles.xpath("//div[#class='prod_naam']//text()[2]").extract()
item["Product_price"] = titles.xpath("//div[#class='prijs']//text()[2]").extract()
items.append(item)
return items
You need to switch your XPath expressions to work in the context of every "product". In order to do this, you need to prepend a dot to the expressions:
def parse(self, response):
products = response.xpath("//ul[#id='prodlist_ul']/li")
for product in products:
item = PrijsvergelijkingItem()
item["Product_ref"] = product.xpath(".//div[#class='prod_naam']//text()[2]").extract_first()
item["Product_price"] = product.xpath(".//div[#class='prijs']//text()[2]").extract_first()
yield item
I've also improved the code a little bit:
I assume you meant to iterate over list items ul->li and not just ul - fixed the expression
used the response.xpath() shortcut method
used extract_first() instead of extract()
improved the variable naming
used yield instead of collecting items in a list and then returning
I am not sure if this can help you, but you can use OrderedDict from collections for your need.
from scrapy.spider import BaseSpider
import scrapy.selector
from collections import OrderedDict
from Prijsvergelijking.items import PrijsvergelijkingItem
class MySpider(BaseSpider):
name = "nvdb"
allowed_domains = ["vandenborre.be"]
start_urls = ["http://www.vandenborre.be/tv-lcd-led/lcd-led-tv-80-cm-alle-producten"]
def parse(self, response):
hxs = scrapy.Selector(response)
titles = hxs.xpath("//ul[#id='prodlist_ul']")
items = []
for titles in titles:
item = OrderedDict(PrijsvergelijkingItem())
item["Product_ref"] = titles.xpath("//div[#class='prod_naam']//text()[2]").extract()
item["Product_price"] = titles.xpath("//div[#class='prijs']//text()[2]").extract()
items.append(item)
return items
Also you might have to change the way you iterate dict,
for od in items:
for key,value in od.items():
print key,value
Related
i'm trying to scrape this site using scrapy but returns all the value in a
single cell, i except each value in a different row.
example:
milage: 25
milage: 377
milage: 247433
milage: 464130
but i'm getting the data like this
example:
milage:[u'25',
u'377',
u'247433',
u'399109',
u'464130',
u'399631',
u'435238',
u'285000',
u'287470',
u'280000']
here is my code
import scrapy
from ..items import ExampleItem
from scrapy.selector import HtmlXPathSelector
url = 'https://example.com'
class Example(scrapy.Spider):
name = 'example'
allowed_domains = ['www.example.com']
start_urls = [url]
def parse(self, response):
hxs = HtmlXPathSelector(response)
item_selector = hxs.select('//div[#class="listing_format card5 relative"]')
for fields in item_selector:
item = ExampleItem()
item ['Mileage'] = fields.select('//li[strong="Mileage"]/span/text()').extract()
yield item
You didn't show your site but may be you need relative XPath:
item ['Mileage'] = fields.select('.//li[strong="Mileage"]/span/text()').extract_first()
It sounds like you need to iterate over your milages.
for fields in item_selector:
milages = fields.select('//li[strong="Mileage"]/span/text()').extract()
for milage in milages:
item = CommercialtrucktraderItem()
item ['Mileage'] = milage
yield item
Also consider making your fields.select('//li[strong="Mileage"]/span/text()').extract() more specific?
Very new to scrapy, so bear with me.
First, here is my code:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from usdirectory.items import UsdirectoryItem
from scrapy.http import Request
class MySpider(BaseSpider):
name = "usdirectory"
allowed_domains = ["domain.com"]
start_urls = ["url_removed_sorry"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select('//*[#id="holder_result2"]/a[1]/span/span[1]/text()').extract()
for title in titles:
item = UsdirectoryItem()
item["title"] = title
item
yield item
That works...but it only grabs the first item.
I noticed in the items I am trying to scrape, the Xpath changes for each row. For example, the first row is the xpath you see above:
//*[#id="holder_result2"]/a[1]/span/span[1]/text()
then it increments by 2, all the way to 29. So the second result:
//*[#id="holder_result2"]/a[3]/span/span[1]/text()
Last result:
//*[#id="holder_result2"]/a[29]/span/span[1]/text()
So my question is how do I get the script to grab all those, and I don't care if i have to copy and paste code for every item. All the other pages are exactly the same. I'm just not sure how to go about it.
Thank you very much.
Edit:
import scrapy
from scrapy.item import Item, Field
class UsdirectoryItem(scrapy.Item):
title = scrapy.Field()
Given the pattern is exactly as you described, you can use XPath modulo operator mod on position index of a to get all the target a elements :
//*[#id="holder_result2"]/a[position() mod 2 = 1]/span/span[1]/text()
For a quick demo, consider the following input XML :
<div>
<a>1</a>
<a>2</a>
<a>3</a>
<a>4</a>
<a>5</a>
</div>
Given this XPath /div/a[position() mod 2 = 1], the following elements will be returned :
<a>1</a>
<a>3</a>
<a>5</a>
See live demo in xpathtester.com here
Let me know if this works for you. Notice we are iterating over a[i] instead of a[1]. The results are stored in a list (hopefully).
def parse(self, response):
hxs = HtmlXPathSelector(response)
for i in xrange(15):
titles = hxs.select('//*[#id="holder_result2"]/a[' + str(1+i*2) + ']/span/span[1]/text()').extract()
for title in titles:
item = UsdirectoryItem()
item["title"] = title
item #erroneous line?
items.append(item)
yield item
im new with Scrapy and web crawling and I've been working on the page www.mercadolibre.com.mx I have to get (from the startpage) some data (descripton and prices) about the produtcs displayed in there. Here is my items.py:
from scrapy.item import Item, Field
class PruebaMercadolibreItem(Item):
producto = Field()
precio = Field()
And here is my spider:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from prueba_mercadolibre.items import PruebaMercadolibreItem
class MLSpider(BaseSpider):
name = "mlspider"
allowed_domains = ["mercadolibre.com"]
start_urls = ["http://www.mercadolibre.com.mx"]
def parse (self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select("//div[#class='item-data']")
items = []
for titles in titles:
item = PruebaMercadolibreItem()
item["producto"] = titles.select("p[#class='tit le']/#title").extract()
item["precio"] = titles.select("span[#class='ch-price']/text()").extract()
items.append(item)
return items
The problem is that I get the same results in when I change this line:
titles = hxs.select("//div[#class='item-data']")
To this:
titles = hxs.select("//div[#class='item-data'] | //div[#class='item-data item-data-mp']")
And Im not getting the same data as when I use the first line.
Can anyone help me? do I have any errorin my xPath selection?
Also I cant find a good tutorial for using MySQL with scrapy, I would appreciate any help. Thx
Better use contains if you want to get all div tags containing item-data class:
titles = hxs.select("//div[contains(#class, 'item-data')]")
Also, you have other problems in the spider:
the loop, you are overriding the titles
class name in producto xpath should be title, not tit le
you probably don't want to have lists in Field values, get the first items out of the extracted lists
HtmlXPathSelector is deprecated, use Selector instead
select() is deprecated, use xpath() instead
BaseSpider has been renamed to Spider
Here's the code with modifications:
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.item import Item, Field
from prueba_mercadolibre.items import PruebaMercadolibreItem
class MLSpider(Spider):
name = "mlspider"
allowed_domains = ["mercadolibre.com"]
start_urls = ["http://www.mercadolibre.com.mx"]
def parse (self, response):
hxs = Selector(response)
titles = hxs.xpath("//div[contains(#class, 'item-data')]")
for title in titles:
item = PruebaMercadolibreItem()
item["producto"] = title.xpath("p[#class='title']/#title").extract()[0]
item["precio"] = title.xpath("span[#class='ch-price']/text()").extract()[0]
yield item
Example items from the output:
{'precio': u'$ 35,000', 'producto': u'Cuatrimoto, Utv De 500cc 4x4 ,moto , Motos, Atv ,'}
{'precio': u'$ 695', 'producto': u'Reloj Esp\xeda Camara Oculta Video Hd 16 Gb! Sony Compara.'}
Total newbie, trying to get scrapy to read a list of urls from csv and return the items in a csv.
Need some help to figure out where I'm going wrong here:
Spider code:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
import random
class incyspider(BaseSpider):
name = "incyspider"
def __init__(self):
super(incyspider, self).__init__()
domain_name = "incyspider.co.uk"
f = open("urls.csv")
start_urls = [url.strip() for url in f.readlines()]
f.close
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[#class="Product"]')
items = []
for site in sites:
item['title'] = hxs.select('//div[#class="Name"]/node()').extract()
item['hlink'] = hxs.select('//div[#class="Price"]/node()').extract()
item['price'] = hxs.select('//div[#class="Codes"]/node()').extract()
items.append(item)
return items
SPIDER = incyspider()
Here's the items.py code:
from scrapy.item import Item, Field
class incyspider(Item):
# define the fields for your item here like:
# name = Field()
title = Field()
hlink = Field()
price = Field()
pass
To run, I'm using
scrapy crawl incyspider -o items.csv -t csv
I would seriously appreciate any pointers.
I'm not exactly sure but after a quick look at your code I would say that at least you need to replace this line
sites = hxs.select('//div[#class="Product"]')
by this line
sites = hxs.select('//div[#class="Product"]').extract()
As a first punt at answering this, your spider code is missing an import for your incyspider item class. Also you're not creating an instance of any kind of item to store the title/hlink/price info, so the items.append(item) line might complain.
Since your spider is also called incyspider, you should rename the item to be something like incyspiderItem and then add the following line to your spider code
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
import random
from incyspider.items import incyspiderItem
class incyspider(BaseSpider):
name = "incyspider"
def __init__(self):
super(incyspider, self).__init__()
domain_name = "incyspider.co.uk"
f = open("urls.csv")
start_urls = [url.strip() for url in f.readlines()]
f.close
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[#class="Product"]')
items = []
for site in sites:
item = incyspiderItem()
item['title'] = hxs.select('//div[#class="Name"]/node()').extract()
item['hlink'] = hxs.select('//div[#class="Price"]/node()').extract()
item['price'] = hxs.select('//div[#class="Codes"]/node()').extract()
items.append(item)
return items
If I'm wrong, then please edit the question to explain how you know there is a problem with the code eg: is the expected output different to the actual output? If so, how?
I am a newbie.
This is my spider:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from ampa.items import AmpaItem
class AmpaSpider(CrawlSpider):
name = "ampa"
allowed_domains = ['website']
start_urls = ['website/page']
rules = (Rule(SgmlLinkExtractor(allow=('associados?', ), deny=('associado/', )), callback='parse_page', follow=True),)
def parse_page(self, response):
hxs = HtmlXPathSelector(response)
item = AmpaItem()
farmers = hxs.select('//div[#class="span-24 tx_left"]')
item['nome'] = farmers.select('//div/h3[#class="titulo"]/a/text()').extract()
item['phone'] = farmers.select('//div/span[#class="chamada"]/a[contains(text(), "Telefone")]/text()').extract()
item['email'] = farmers.select('//div/span[#class="chamada"]/a[contains(text(), "E-mail")]/text()').extract()
print item.values()
return item
This is my pipeline:
class CsvWriterPipeline(object):
def __init__(self):
self.csvwriter = csv.writer(open('items.csv', 'wb'))
def process_item(self, item, ampa):
self.csvwriter.writerow([item['nome'], item['phone'], item['email']])
return item
Each page of the website has a list of names, phones and e-mails. The code above will output a csv file with three columns and one row for each page. In the first column, each cell is a list of all names in that page, in the second column they are a list of all phones and in the third column they are a list of all e-mails.
What I really want to do is to have each name, phone and e-mail in individual rows. I tried to do it by looping through each item, but it only prints the first name, phone and e-mail on each page. (Is it because callback moves the crawler to the next URL each time the function spider returns an item) (Does it???)
How would you go about that?
Here is the item:
from scrapy.item import Item, Field
class AmpaItem(Item):
nome = Field()
phone = Field()
email = Field()
Based on your use of the plural in farmes, I assume there are many farmers on the page. So you expression will likely return a collection of farmers.
Can you loop through the result of farmers and yield each item?
#pseudocode
hxs = HtmlXPathSelector(response)
farmers = hxs.select('//div[#class="span-24 tx_left"]')
for farmer in farmer:
item = AmpaItem()
#be sure to select only one desired farmer here
item['nome'] = farmers.select('//div/h3[#class="titulo"]/a/text()').extract()
item['phone'] = farmers.select('//div/span[#class="chamada"]/a[contains(text(), "Telefone")]/text()').extract()
item['email'] = farmers.select('//div/span[#class="chamada"]/a[contains(text(), "E-mail")]/text()').extract()
yield item
I found the solution by changing my pipeline:
import csv
import itertools
class CsvWriterPipeline(object):
def __init__(self):
self.csvwriter = csv.writer(open('items.csv', 'wb'), delimiter=',')
def process_item(self, item, ampa):
for i,n,k in itertools.izip(item['nome'],item['phone'],item['email']):
self.csvwriter.writerow([i,n,k])
return item
Thaks DrColossos and dm03514!!
This was my first question on stackoverflow!!