Geting data from table with Scrapy, different row order per page - python

So I have scrapy working really well. It's grabbing data out of a page, but the problem I'm running into is that sometimes the page's table order is different. For example, the first page it gets to:
Row name Data
Name 1 data 1
Name 2 data 2
The next page it crawls to might have the order completely different. Where Name 1 was the first row, any other page it might be the 3rd, or 4th etc. The row names are always the same. I was thinking of doing this possibly 1 of 2 different ways, I'm not sure which will work or which is better.
First option, use some if statements to find the row I need, and then grab the following column. This seems a little messy but could work.
Second option, grab all the data in the table regardless of order and put it in a dict. This way, I can grab the data I need based on row name. This seems like the cleanest approach.
Is there a 3rd option or a better way of doing either?
Here's my code in case it's helpful.
class pageSpider(Spider):
name = "pageSpider"
allowed_domains = ["domain.com"]
start_urls = [
"http://domain.com/stuffs/results",
]
visitedURLs = Set()
def parse(self, response):
products = Selector(response).xpath('//*[#class="itemCell"]')
for product in products:
item = PageScraper()
item['url'] = product.xpath('div[2]/div/a/#href').extract()[0]
urls = Set([product.xpath('div[2]/div/a/#href').extract()[0]])
print urls
for url in urls:
if url not in self.visitedURLs:
request = Request(url, callback=self.productpage)
request.meta['item'] = item
yield request
def productpage(self, response):
specs = Selector(response).xpath('//*[#id="Specs"]')
item = response.meta['item']
for spec in specs:
item['make'] = spec.xpath('fieldset[1]/dl[1]/dd/text()').extract()[0].encode('utf-8', 'ignore')
item['model'] = spec.xpath('fieldset[1]/dl[4]/dd/text()').extract()[0].encode('utf-8', 'ignore')
item['price'] = spec.xpath('fieldset[2]/dl/dd/text()').extract()[0].encode('utf-8', 'ignore')
yield item
The xpaths in productpage can contain data that doesn't correspond to what I need, because the order changed.
Edit:
I'm trying the dict approach and I think this is the best option.
def productpage(self, response):
specs = Selector(response).xpath('//*[#id="Specs"]/fieldset')
itemdict = {}
for i in specs:
test = i.xpath('dl')
for t in test:
itemdict[t.xpath('dt/text()').extract()[0]] = t.xpath('dd/text()').extract()[0]
item = response.meta['item']
item['make'] = itemdict['Brand']
yield item

This seems like the best and cleanest approach (using dict)
def productpage(self, response):
specs = Selector(response).xpath('//*[#id="Specs"]/fieldset')
itemdict = {}
for i in specs:
test = i.xpath('dl')
for t in test:
itemdict[t.xpath('dt/text()').extract()[0]] = t.xpath('dd/text()').extract()[0]
item = response.meta['item']
item['make'] = itemdict['Brand']
yield item

Related

List elements retrieved by Xpath in scrapy do not output correctly item by item(for,yield)

I am outputting the URL of the first page of the order results page of an exhibitor extracted from a specific EC site to a csv file, reading it in start_requests, and looping through it with a for statement.
Each order result page contains information on 30 products.
https://www.buyma.com/buyer/2597809/sales_1.html
itempage
Specify the links for the 30 items on each order results page and list? type, and I tried to retrieve them one by one and store them in the item as shown in the code below, but it does not work.
class AllSaledataSpider(CrawlSpider):
name = 'all_salesdata_copy2'
allowed_domains = ['www.buyma.com']
def start_requests(self):
with open('/Users/morni/researchtool/AllshoppersURL.csv', 'r', encoding='utf-8-sig') as f:
reader = csv.reader(f)
for row in reader:
for n in range(1, 300):
url =str((row[2])[:-5]+'/sales_'+str(n)+'.html')
yield scrapy.Request(
url=url,
callback=self.parse_firstpage_item,
dont_filter=True
)
def parse_firstpage_item(self, response):
loader = ItemLoader(item = ResearchtoolItem(), response = response)
Conversion_date = response.xpath('//*[#id="buyeritemtable"]/div/ul/li[2]/p[3]/text()').getall()
product_name = response.xpath('//*[#id="buyeritemtable"]/div/ul/li[2]/p[1]/a/text()').getall()
product_URL = response.xpath('//*[#id="buyeritemtable"]/div/ul/li[2]/p[1]/a/#href').getall()
for i in range(30):
loader.add_value("Conversion_date", Conversion_date[i])
loader.add_value("product_name", product_name[i])
loader.add_value("product_URL", product_URL[i])
yield loader.load_item()
Specify the links for the 30 items on each order results page and list? type, and I tried to retrieve them one by one and store them in the item as shown in the code below, but it does not work.
The output is as follows, where each item contains multiple items of information at once.
Current status:
{"product_name": ["product1", "product2"]), "Conversion_date":["Conversion_date1", "Conversion_date2" ], "product_URL":["product_URL1", "product_URL2"]},
Ideal:
[{"product_name": "product1", "Conversion_date": Conversion_date1", "product_URL": "product_URL1"},{"product_name": "product2", "Conversion_date": Conversion_date2", "product_URL": "product_URL2"}]
This may be due to my lack of understanding of basic for statements and yield.
You need to create a new loader each iteration
for i in range(30):
loader = ItemLoader(item = ResearchtoolItem(), response = response)
loader.add_value("Conversion_date", Conversion_date[i])
loader.add_value("product_name", product_name[i])
loader.add_value("product_URL", product_URL[i])
yield loader.load_item()
EDIT:
add_value appends a value to the list. Since you had zero elements in the list, then after you append you'll have a list with one element.
In order to get the values as a string you can use a processor. Example:
import scrapy
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst
class ProductItem(scrapy.Item):
name = scrapy.Field(output_processor=TakeFirst())
price = scrapy.Field(output_processor=TakeFirst())
class ExampleSpider(scrapy.Spider):
name = 'exampleSpider'
start_urls = ['https://scrapingclub.com/exercise/list_infinite_scroll/']
def parse(self, response, **kwargs):
names = response.xpath('//div[#class="card-body"]//h4/a/text()').getall()
prices = response.xpath('//div[#class="card-body"]//h5//text()').getall()
length = len(names)
for i in range(length):
loader = ItemLoader(item=ProductItem(), response=response)
loader.add_value('name', names[i])
loader.add_value('price', prices[i])
yield loader.load_item()

Scrapy request does not callback

I am trying to create a spider that takes data from a csv (two links and a name per row), and scrapes a simple element (price) from each of those links, returning an item for each row, with the item's name being the name in the csv, and two scraped prices (one from each link).
Everything works as expected except the fact that instead of returning the prices, that would be returned from the callback function of each request, I get a request object like this :
< GET https://link.com>..
The callback functions don't get called at all, why is that?
Here is the spider:
f = open('data.csv')
f_reader = csv.reader(f)
f_data = list(f_reader)
parsed_data = []
for product in f_data:
product = product[0].split(';')
parsed_data.append(product)
f.close()
class ProductSpider(scrapy.Spider):
name = 'products'
allowed_domains = ['domain1', 'domain2']
start_urls = ["domain1_but_its_fairly_useless"]
def parse(self, response):
global parsed_data
for product in parsed_data:
item = Product()
item['name'] = product[0]
item['first_price'] = scrapy.Request(product[1], callback=self.parse_first)
item['second_price'] = scrapy.Request(product[2], callback=self.parse_second)
yield item
def parse_first(self, response):
digits = response.css('.price_info .price span').extract()
decimals = response.css('.price_info .price .price_demicals').extract()
yield float(str(digits)+'.'+str(decimals))
def parse_second(self, response):
digits = response.css('.lr-prod-pricebox-price .lr-prod-pricebox-price-primary span[itemprop="price"]').extract()
yield digits
Thanks in advance for your help!
TL;DR: You are yielding an item with Request objects inside of it when you should yield either Item or Request.
Long version:
Parse methods in your spider should either return a scrapy.Item - in which case the chain for that crawl will stop and scrapy will put out an item or a scrapy.Requests in which case scrapy will schedule a request to continue the chain.
Scrapy is asynchronious so to create an item from multiple requests means you need to chain all of those requests while carrying your item to every one of item and fill it up little by little.
Request object has meta attribute where you can store anything you want to (well pretty much) and it will be carried to your callback function. It's very common to use it to chain requests for items that require multiple requests to form a single item.
Your spider should look something like this:
class ProductSpider(scrapy.Spider):
# <...>
def parse(self, response):
for product in parsed_data:
item = Product()
item['name'] = product[0]
# carry next url you want to crawl in meta
# and carry your item in meta
yield Request(product[1], self.parse_first,
meta={"product3": product[2], "item":item})
def parse_first(self, response):
# retrieve your item that you made in parse() func
item = response.meta['item']
# fill it up
digits = response.css('.price_info .price span').extract()
decimals = response.css('.price_info .price .price_demicals').extract()
item['first_price'] = float(str(digits)+'.'+str(decimals))
# retrieve next url from meta
# carry over your item to the next url
yield Request(response.meta['product3'], self.parse_second,
meta={"item":item})
def parse_second(self, response):
# again, retrieve your item
item = response.meta['item']
# fill it up
digits = response.css('.lr-prod-pricebox-price .lr-prod-pricebox-price-primary
span[itemprop="price"]').extract()
item['secodn_price'] = digits
# and finally return the item after 3 requests!
yield item

How to populate scrapy item list with hardcoded string

my Scrapy crawler is working fine, currently he is crawling some tables, but on some website there are not all information on hand which I like to insert into my mysql table.
So I thought about adding them myself, because on those websites the information is for those fields the same, but I am not sure how to populate them in the spider.
Sure, I could determine the length of one of the lists in the pipeline and then use a while loop to add for example USA in the item['country'] list but I want to do the same in the spider.
I would apppreciate some help, thank you.
Current spider code for populating lists:
def parse(self, response):
for sel in response.xpath('//div[#class="pagecontainer"]'):
item = EbayItem()
item['id'] = sel.xpath('div[2]/text()[2]').extract()
item['user'] = sel.xpath('tr/td[2]/text()[1]').extract()
item['string'] = sel.xpath ('tr/td[2]/a/text()').extract()
item['state'] = sel.xpath('tr/td[3]/b[3]/text()').extract()
item['country'] = sel.xpath('tr/td[3]/b[1]/text()').extract()
item['weight'] = sel.xpath('tr/td[3]/b[2]/text()').extract()
item['position'] = sel.xpath('tr/td[4]/text()').re(r'[0-9,\-]+')
item['old'] = sel.xpath('tr/td[5]/text()').extract()
item['datetime'] = sel.xpath('tr/td[6]/text()').re('[0-9]{2}.[0-9]{2}.[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}')
yield item
Greetings
P.Halmsich
You want to add things in MySQL. This means that your fields shouldn't be arrays (e.g. ['my-value']) but scalars (e.g. 'my-value'). The easiest way to do this is by using extract_first() instead of extract().
extract_first() allows you to set default values like this: .extract_first(default='my-default-value') or just .extract_first('my-default-value')
Cheers
You can always check the scraped item for empty results using if-else statement. Try the code below:
def parse(self, response):
for sel in response.xpath('//div[#class="pagecontainer"]'):
item = EbayItem()
item['id'] = sel.xpath('div[2]/text()[2]').extract()
item['user'] = sel.xpath('tr/td[2]/text()[1]').extract()
item['string'] = sel.xpath ('tr/td[2]/a/text()').extract()
item['state'] = sel.xpath('tr/td[3]/b[3]/text()').extract()
item['country'] = sel.xpath('tr/td[3]/b[1]/text()').extract()
if item['country'] == []:
item['country'] = 'USA'
item['weight'] = sel.xpath('tr/td[3]/b[2]/text()').extract()
item['position'] = sel.xpath('tr/td[4]/text()').re(r'[0-9,\-]+')
item['old'] = sel.xpath('tr/td[5]/text()').extract()
item['datetime'] = sel.xpath('tr/td[6]/text()').re('[0-9]{2}.[0-9]{2}.[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}')
yield item

Conditionally Scraping Domains based on keywords stored in a csv using Scrapy

I have a list of keywords stored in a csv which I require to be searched for at different domains and scrape information.
I've tried to instruct my spider to follow them in the order of : { example1.com, example2.com, example3.com and example4.com }
The Spider only enters the next domain if it does not find a match in the previous domain. If a match for the keywords has been found in any of those domains, the next keyword from my csv is then picked next search restarts from example1.com
Importantly, I also require that particular keyword which is picked from the csv to be stored in one of the items fields.
So far my code is:
item = ExampleItem()
f = open("InputKeywords.csv")
csv_file = csv.reader(f)
productname_list = []
for row in csv_file:
productname_list.append(row[1])
class MySpider(CrawlSpider):
name = "test1"
allowed_domains = ["example1.com", "example2.com", "example3.com"]
def start_requests(self):
for keyword in productname_list:
item ["Product_Name"] = keyword #loading the searched keyword in my output
request=Request("http://www.example1.com/search?noOfResults=20&keyword="+str(keyword),self.Example1)
yield request
if item ["Example1"] == "No Image Found on Example1":
request=Request("www.Example2.in/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords="+str(keyword),self.Example2)
yield request
if item ["Example2"] == "No Image Found on Example3":
request=Request("http://www.Example3.com/search?noOfResults=20&keyword="+str(keyword),self.Example3):
yield request
def Example1(self,response):
sel = Selector(response)
result = response.xpath("//div[#class='hoverProductImage product-image '][1]/a/#href") #Checking if the Search Term Exists on Domain
if result:
request = Request(result.extract()[0],callback=self.Example1Items) #For Parsing Information if search keyword found
request.meta["item"] = item
return request
else:
item ["Example1"] = "No Image Found at Example1"
return item
def Example1Items(self,response):
sel = Selector(response)
item = response.meta['item']
item ["Example1"] = sel.xpath("//meta[#name='og_image']/#content").extract()
return item
def Example2(self,response):
sel = Selector(response)
result= response.xpath("//div[#class='a-row a-spacing-small'][1]/a/#href")
if result:
request = Request(result.extract()[0],callback=self.Example2Items)
request.meta["item"] = item
return request
else:
item ["Example2"] = "No Image Found at Example2"
return item
def Example2Items(self,response):
sel = Selector(response)
item = response.meta['item']
item ["Example2"] = sel.xpath("//div[#class='ap_content']/div/div/div/img/#src").extract()
return item
----CODE FOR EXAMPLE 3 and EXAMPLE 3 Items----
My code is far from correct but the first error I'm facing is that my keywords are not being stored in the same order as that of the input csv.
I also am not being able to execute the logic for example2 or example 3 searches based on the not found condition.
Any help would be appreciated.
Basically I need my output which I'll be storing in a csv to be look like this:
{
"Keyword1", "Example1Found","","",
"Keyword2", "No Image Found at Example1","No Image Found at Example2","Example3Found",
"Keyword3", "No Image Found at Example1","Example2Found","",
}

Modify item in multiple parse function and return updated item?

I have an item which will be filled in each of the parse function. I want to return updated item after completion of parsing. Here is my scenario:
My Item class:
class MyItem(Item):
name = Field()
links1 = Field()
links2 = Field()
I have multiple urls to crawl after login:
in parse function, I'm doing:
for url in urls:
yield Request(url=url, callback=self.get_info)
In get_info, I will be extracting 'name' and 'links' in each response:
item = MyItem()
item['name'] = hxs.select("//title/text()").extract()
links = []
link = {}
for data in json_parsed_from_response:
link['name'] = data.get('name')
link['url'] = data.get('url')
links.append(link)
item['links1] = links
#similarly, item['links2'] is created.
Now, I want to go through each of the url in each of the item['links1] and item['links2'] as(these loops are inside get_info):
for link in item['links1']:
request = Request(url= link['url'], callback=self.get_status)
request.meta['link'] = link
yield request
for link in item['links2']:
request = Request(url= link['url'], callback=self.get_status)
request.meta['link'] = link
yield request
# Where do I return item, can't return item inside generator
def get_status(self, response):
link = response.meta['link']
if "good" in response.body:
link['status'] = 'good'
else:
link['status'] = 'bad'
# Changes made here, will be reflected in item?
# Also, I can't return item from here. Multiple items will be returned.
I can't figure out from where item has to be returned and it should have all the updated data.
Sorry, but unless you give out some more details I can't understand the design of your code and therefore I can't help... The best suggestion I have is to create a list of *MyItem*s and append each item you create to that list. The values should change as you change them. So you should be able to iterate over the list and see the updated items.

Categories

Resources