Scrapy Extract ld+JSON - python

How to extract the name and url?
quotes_spiders.py
import scrapy
import json
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ["http://www.lazada.com.my/shop-power-banks2/?price=1572-1572"]
def parse(self, response):
data = json.loads(response.xpath('//script[#type="application/ld+json"]//text()').extract_first())
//how to extract the name and url?
yield data
Data to Extract
<script type="application/ld+json">{"#context":"https://schema.org","#type":"ItemList","itemListElement":[{"#type":"Product","image":"http://my-live-02.slatic.net/p/2/test-product-0601-7378-08684315-8be741b9107b9ace2f2fe68d9c9fd61a-webp-catalog_233.jpg","name":"test product 0601","offers":{"#type":"Offer","availability":"https://schema.org/InStock","price":"99999.00","priceCurrency":"RM"},"url":"http://www.lazada.com.my/test-product-0601-51348680.html?ff=1"}]}</script>

This line of code returns a dictionary with the data you want:
data = json.loads(response.xpath('//script[#type="application/ld+json"]//text()').extract_first())
All you need to do is to access it like:
name = data['itemListElement'][0]['name']
url = data['itemListElement'][0]['url']
Given that the microdata contains a list you will need to check you are referring to the correct product in the list.

A really easy solution for this would be to use https://github.com/scrapinghub/extruct. It handles all the hard parts of extracting structured data.

Related

Scrapy Table column and rows doesn't work

I want to scrapy the table of this page, but the scrapped data is only in one column, and in some case the data doesn't appear. Also I use the shell to see if the Xpath is correct (I use the Xpath helper to identify these xpath)
import scrapy
class ToScrapeSpiderXPath(scrapy.Spider):
name = 'scrape-xpath'
start_urls = [
'http://explorer.eu/contents/food/28?utf8=/',
]
def parse(self, response):
for flv in response.xpath('//html/body/main/div[4]'):
yield {
'Titulo': flv.xpath('//*#id="chromatography"]/table/tbody/tr[3]/th/strong/a/text()"]/tbody/tr[5]/td[3]/a[2]').extract(),
'contenido': flv.xpath('//*#id="chromatography"]/table/tbody/tr[5]/td[3]/a[2]/text()').extract(),
'clase': flv.xpath('//*[#id="chromatography"]/table/tbody/tr[5]/td[1]/text()').extract(),
'Subclase': flv.xpath('//*[#id="chromatography"]/table/tbody/tr[5]/td[2]/a/text').extract(),
}
From the example URL given, it's not exactly obvious what the values should be and how extraction should generalize for page containing more records. So I tried a different page with multiple records, let's see if the result gets you what you need. Here's ready to run code:
# -*- coding: utf-8 -*-
import scrapy
class PhenolExplorerSpider(scrapy.Spider):
name = 'phenol-explorer'
start_urls = ['http://phenol-explorer.eu/contents/food/29?utf8=/']
def parse(self, response):
chromatography = response.xpath('//div[#id="chromatography"]')
title = chromatography.xpath('.//tr/th[#class="outer"]/strong/a/text()').extract_first()
for row in chromatography.xpath('.//tr[not(#class="header")]'):
class_ = row.xpath('./td[#rowspan]/text()').extract_first()
if not class_:
class_ = row.xpath('./preceding-sibling::tr[td[#rowspan]][1]/td[#rowspan]/text()').extract_first()
subclass = row.xpath('./td[not(#rowspan)][1]/a/text()').extract_first()
#content = row.xpath('./td[not(#rowspan)][2]/a[2]/text()').extract_first()
content = row.xpath('./td[not(#rowspan)][2]/text()').extract_first()
yield {
'title': title.strip(),
'class': class_.strip(),
'subclass': subclass.strip(),
'content': content.strip(),
}
Basically, it iterates over individual rows of the table and extracts the data from corresponding fields, yielding an item once complete information is collected.
Try this:
for row in response.css('#chromatography table tr:not(.header)'):
yield {'titulo': row.xpath('./preceding-sibling::tr/th[contains(#class, "outer")]//a/text()').extract_first().strip(),
'clase': row.xpath('./preceding-sibling::tr/th[contains(#class, "inner")]//text()').extract_first().strip(),
'subclase': row.xpath('./td[2]//text()').extract_first().strip(),
'contenido': row.css('.content_value a::text').extract_first().strip()}
remember that the inner loop selectors should also be relative to node flv in your case, selecting with // is a global selector so it'll grab everything.
It's also better to inspect the real html code, because the browser might render some other code different to the actual html received (for example the tbody tags)

Scrapy parse to pipeline

For example I want to crawl three similar urls:
https://example.com/book1
https://example.com/book2
https://example.com/book3
What I want is in the pipeline.py, that I create 3 files named book1, book2 and book3, and write the 3 books' data correctly and separately
In the spider.py, I know the three books' name which as the file name, but not in the pipeline.py
They have a same structure, so I decide to code like below:
class Book_Spider(scrapy.Spider):
def start_requests(self):
for url in urls:
yield scrapy.Request(url, self.parse)
def parse(self, response):
# item handling
yield item
Now, how can I do?
Smith, If you want to know book name in pipeline.py. there are two options either you make a item field for book_file_name and populate it accordingly as you want. or you can extract it from url field which is also a item field and can access in pipline.py

Using a Loop to enter values in "start_urls" function to input values from a csv

I basically have a list of titles to search on a website which are stored in a
csv.
I'm extracting those values and then trying to add append them to the search link in the start_urls function.
However, when I run the script, it only takes the last value of the list.
Is there any particular reason why this happens?
class MySpider(CrawlSpider):
name = "test"
allowed_domains = ["example.com"]
import pandas as pd
df = pd.read_csv('test.csv')
saved_column = df.ProductName
for a in saved_column:
start_urls = ["http://www.example.com/search?noOfResults=20&keyword="+str(a)"]
def parse(self,response):
There is a conceptual error in your code. You are making the loop but without any action other than rotating the urls. So the parse function is called with the last value of the loop.
A possible other approach would be to override 'start_requests' method of the spider:
def start_requests(self):
df = pd.read_csv('test.csv')
saved_column = df.ProductName
for url in saved_column:
yield Request(url, self.parse)
Idea got from here: How to generate the start_urls dynamically in crawling?

Web Crawling :How to obtain URLs which use database information?

Here is my problem statement :
I'm trying to retrieve all the well specific information for a state from http://www.aogc2.state.ar.us/AOGConline/ . After doing a bit of R&D , i figured out that individual well information is stored in path structured as :
http://www.aogc2.state.ar.us/AOGConline/ED.aspx?KeyName=API_WELLNO&KeyValue=03143100280000&KeyType=STRING&DetailXML=WellDetails.xml
where each KeyValue is unique for every well.I was trying to derive a generic pattern in the KeyValue - for above URL eg in 3143100280000 , 03 represents state(arkansas),143 represents County , but the remaining no - 100280000 is not necessarily following a serial pattern and thus makes life difficult.
Is there a way through which all the KeyValues for 43K+ wells be obtained here (which i'm presuming is coming from a database) ?Tried looking for all sources js files being loaded from http://www.aogc2.state.ar.us/AOGConline/ but none points towards all KeyValues/Well API source directory
Using Python Scrapy i've written the following spider which crawls few specific Well XML URLs.In need to make this generic so as to obtain all 43k+ well information but not being able to attain a way to figure out all the KeyValues here
from scrapy.spider import Spider
from scrapy.selector import Selector
import codecs
class AogcSpider(Spider):
name = "aogc"
allowed_domains = ["http://www.aogc2.state.ar.us/"]
start_urls = [
"http://www.aogc2.state.ar.us/AOGConline/ED.aspx?KeyName=API_WELLNO&KeyValue=03143100280000&KeyType=STRING&DetailXML=WellDetails.xml",
"http://www.aogc2.state.ar.us/AOGConline/ED.aspx?KeyName=API_WELLNO&KeyValue=03143100290000&KeyType=STRING&DetailXML=WellDetails.xml",
"http://www.aogc2.state.ar.us/AOGConline/ED.aspx?KeyName=API_WELLNO&KeyValue=03143100300000&KeyType=STRING&DetailXML=WellDetails.xml",
"http://www.aogc2.state.ar.us/AOGConline/ED.aspx?KeyName=API_WELLNO&KeyValue=03143100310000&KeyType=STRING&DetailXML=WellDetails.xml",
"http://www.aogc2.state.ar.us/AOGConline/ED.aspx?KeyName=API_WELLNO&KeyValue=03143100320000&KeyType=STRING&DetailXML=WellDetails.xml",
"http://www.aogc2.state.ar.us/AOGConline/ED.aspx?KeyName=API_WELLNO&KeyValue=03143100330000&KeyType=STRING&DetailXML=WellDetails.xml"
]
def parse(self,response):
hxs = Selector(response)
trnodes = hxs.xpath("//td[#class='ColumnValue']")
filename = codecs.open("aogc_wells","a","utf-8-sig")
filename.write("\n")
for nodes in trnodes:
ftext = nodes.xpath("text()").extract()
for txt in ftext:
filename.write(txt)
filename.write("|")

Using Scrapy for XML page

I'm trying to scrape multiple pages from an API to practice and develop my XML scrapping. One issue that has arisen is that when I try to scrape a document formatted like this: http://i.imgur.com/zJqeYvG.png and store it as an XML it fails to do so.
So within the CMD it fetches the URL it creates the XML file on my computer but there's nothing in it.
How would I fix it to echo out the whole document or even parts of it?
I put the code below:
from scrapy.spider import BaseSpider
from scrapy.selector import XmlXPathSelector
from doitapi.items import DoIt
import random
class MySpider(BaseSpider):
name = "craig"
allowed_domains = ["do-it.org.uk"]
start_urls = []
number = []
for count in range(100):
number.append(random.randint(2000000,2500000))
for i in number:
start_urls.append("http://www.do-it.org.uk/syndication/opportunities/%d?apiKey=XXXXX-XXXX-XXX-XXX-XXXXX" %i)
def parse(self, response):
xxs = XmlXPathSelector(response)
titles = xxs.register_namespace("d", "http://www.do-it.org.uk/volunteering-opportunity")
items = []
for titles in titles:
item = DoIt()
item ["url"] = response.url
item ["name"] = titles.select("//d:title").extract()
item ["description"] = titles.select("//d:description").extract()
item ["username"] = titles.select("//d:info-provider/name").extract()
item ["location"] = titles.select("//d:info-provider/address").extract()
items.append(item)
return items
Your XML file is using the namespace "http://www.do-it.org.uk/volunteering-opportunity" so to select title, name etc. you have 2 choices:
either use xxs.remove_namespaces() once and then use .select("./title"), .select("./description") etc.
or register the namespace once, with a prefix like "doit", xxs.register_namespace("doit", "http://www.do-it.org.uk/volunteering-opportunity"), and then use .select("./doit:title"), .select("./doit:description") etc.
For more details on XML namespaces, see this page in the FAQ and this page in the docs

Categories

Resources