I'm trying to get Scrapy to extract the author, date, and post from the forum https://bitcointalk.org/index.php?topic=1209137.0, and import it into my items.
My desired results are: (with extraneous html that I'll clean later)
author 1, date 1, post 1
author 2, date 2, post 2,
But instead I get:
author 1,2,3,4 date 1,2,3,4, post 1,2,3,4
I've searched around and read a few things on changing xPaths from absolute to relative, but I can't seem to get it working properly. I'm unsure if that's the root cause, or if I need to create a pipeline to transform the data?
*************UPDATE**********CODE ATTACHED*********************
class Bitorg(scrapy.Spider):
name = "bitorg"
allowed_domains = ["bitcointalk.org"]
start_urls = [
"https://bitcointalk.org/index.php?topic=1209137.0"
]
def parse(self, response):
for sel in response.xpath('..//html/body'):
item = BitorgItem()
item['author'] = sel.xpath('.//b/a[#title]').extract()
item['date'] = sel.xpath('.//td[#valign="middle"]/div[#class="smalltext"]').extract()
item['post'] = sel.xpath('.//div[#class="post"]').extract()
yield item
While the <table>, <tbody> and <tr> elements don't have attributes that can easily be selected, there is a <td> for each post with a class of poster_info.
To get a list of all posts, select on the <td> and the move up the tree using the xpath .. notation.
posts = response.xpath('//*[#class="poster_info"]/..')
Within each post, select the child elements of interest.
for post in posts:
author = ''.join(post.xpath('.//td[#class="poster_info"]/.//b/a/.//text()').extract())
title = ''.join(post.xpath('.//div[#class="subject"]/.//a/.//text()').extract())
date = ''.join(post.xpath('.//div[#class="subject"]/following-sibling::div/.//text()').extract())
print '%s, %s, %s' % (author, title, date)
You know all that code is just one big div with a small tables insides
and xpaths for authors
/html/body/div[2]/form/table[1]/tbody/tr[1]/td/table/tbody/tr/td/table/tbody/tr[1]/td[1]/b/a
/html/body/div[2]/form/table[1]/tbody/tr[5]/td/table/tbody/tr/td/table/tbody/tr[1]/td[1]/b/a
/html/body/div[2]/form/table[1]/tbody/tr[6]/td/table/tbody/tr/td/table/tbody/tr[1]/td[1]/b/a
you can use this so that you can scrape anything
l = XPathItemLoader(item = JustDialItem(),response = response)
for i in range(1,10):
l.add_xpath('content1','//*[#id="bodyarea"]/form/table[1]/tbody/tr['+str(i)+']/td/table/tbody/tr/td/table/tbody/tr[1]/td[1]/b/a/text()')
l.add_xpath('content2','//*[#id="bodyarea"]/form/table[1]/tbody/tr['+str(i)+']/td/table/tbody/tr/td/table/tbody/tr[1]/td[1]/b/a/text()')
l.add_xpath('content3','//*[#id="bodyarea"]/form/table[1]/tbody/tr['+str(i)+']/td/table/tbody/tr/td/table/tbody/tr[1]/td[1]/b/a/text()')
same way you can do also for date and post
Related
How can I go about parsing data for one variable directly from the start url and data for other variables after following all the href from the start url?
The web page I want to scrape has a list of articles with the "category", "title", "content", "author" and "date" data. In order to scrape the data, I followed all "href" on the start url which redirect to the full article and parsed the data. However, the "category" data is not always available when individual article is opened/followed from the "href" in the start url so it ends up having missing data for some observations. Now, I'm trying to scrape just the "category" data directly from the start url which has the "category" data for all article listings (no missing data). How should I go about parsing "category" data? How should I take care of the parsing and callback? The "category" data is circle in red in the image
class BtcNewsletterSpider(scrapy.Spider):
name = 'btc_spider'
allowed_domains = ['www.coindesk.com']
start_urls = ['https://www.coindesk.com/tag/bitcoin/1/']
def parse(self, response):
for link in response.css('.card-title'):
yield response.follow(link, callback=self.parse_newsletter)
def parse_newsletter(self, response):
item = CoindeskItem()
item['category'] = response.css('.kjyoaM::text').get()
item['headline'] = response.css('.fPbJUO::text').get()
item['content_summary'] = response.css('.jPQVef::text').get()
item['authors'] = response.css('.dWocII a::text').getall()
item['published_date'] = response.css(
'.label-with-icon .fUOSEs::text').get()
yield item
You can use the cb_kwargs argument to pass data from one parse callback to another. To do this you would need to grab the value of the category for the corresponding link to the full article. This can be done by simply iterating through any element that encompasses both the category and the link and pulling the information from both out of said element.
Here is an example based on the code you provided, this should work the way you described.
class BtcNewsletterSpider(scrapy.Spider):
name = 'btc_spider'
allowed_domains = ['www.coindesk.com']
start_urls = ['https://www.coindesk.com/tag/bitcoin/1/']
def parse(self, response):
for card in response.xpath("//div[contains(#class,'articleTextSection')]"):
item = CoindeskItem()
item["category"] = card.xpath(".//a[#class='category']//text()").get()
link = card.xpath(".//a[#class='card-title']/#href").get()
yield response.follow(
link,
callback=self.parse_newsletter,
cb_kwargs={"item": item}
)
def parse_newsletter(self, response, item):
item['headline'] = response.css('.fPbJUO::text').get()
item['content_summary'] = response.css('.jPQVef::text').get()
item['authors'] = response.css('.dWocII a::text').getall()
item['published_date'] = response.css(
'.label-with-icon .fUOSEs::text').get()
yield item
I've scrapped the tabular data i want from a page. Now I want to filter them ('Version' only) using a pipeline:
The web data is available here: https://learn.microsoft.com/en-us/visualstudio/install/visual-studio-build-numbers-and-release-dates?view=vs-2022
'''
from scrapy.exceptions import DropItem
class ScrapytestPipeline:
def process_item(self, item, spider):
if item['Channel'] == 'Release':
return item
else:
raise DropItem("Missing specified keywords.")'''
Problem is its returning nothing now.
The spider:
import scrapy
from ..items import ScrapytestItem
class VsCodeSpider(scrapy.Spider):
name = 'vscode'
start_urls = [
'https://learn.microsoft.com/en-us/visualstudio/install/visual-studio-build-numbers-and-
release-dates?view=vs-2022'
]
def parse(self, response):
item = ScrapytestItem()
products = response.xpath('//table/tbody//tr')
for i in products:
item = dict()
item['Version'] = i.xpath('td[1]//text()').extract()
item['Channel'] = i.xpath('td[2]//text()').extract()
item['Releasedate'] = i.xpath('td[3]//text()').extract()
item['Buildversion'] = i.xpath('td[4]//text()').extract()
yield item
The items.py file:
import scrapy
class ScrapytestItem(scrapy.Item):
Version = scrapy.Field()
Channel = scrapy.Field()
Releasedate = scrapy.Field()
Buildversion = scrapy.Field()
How can I filter (only 'Version' values in the 'Channel' Field) using the pipeline? Thanks!
What you are doing in the code is packing all these things in a dictionary and yielding the dictionary.
However, What I understand from your question is that you need to filter values for versions or channel field.(correct me if I am wrong). I recommend you yield them separately. As shown in the code below, to achieve the desired results.
def parse(self, response):
item = ScrapytestItem()
products = response.xpath('//table/tbody//tr')
for i in products:
Version = i.xpath('td[1]//text()').extract()
Channel = i.xpath('td[2]//text()').extract()
Releasedate = i.xpath('td[3]//text()').extract()
Buildversion = i.xpath('td[4]//text()').extract()
yield { "Version":Version, "Channel": Channel,"Releasedate":Releasedate, "Buildversion":Buildversion }
This way you can filter data using the pipeline. Also, I recommend looking at the itemloaders. They will help create even more sophisticated Data pipelines. You read about them here or watch this video or read this blog.
The problem is that you are using extract instead of extract_first.
When you use extract, it extracts all the items and returns a list of strings. If you run this spider, you will see [] in the output:
{'Version': ['16.0.1'], 'Channel': ['Preview 1'], 'Releasedate': ['April 9, 2019'], 'Buildversion': ['16.0.28803.156']}
You are dropping everything which does not equal a string, and your results are a list.
To fix it, change extract to extract_first
This is a common error, and thus it is recommended to use the get and getall methods, which are more obvious.
TLDR; Use get if you want a string, getall when you want a list. Also, stop using extract for better readability.
I'm using this code. The last two Values are there like that because I was testing to see if either one of them will work- they don't, though.
def parse_again(self, response):
sel = Selector(response)
meta = sel.xpath('//div[#class="LWimg"]')
items = []
for m in meta:
item = PageItem()
item['link'] = response.url
item['Stake'] = m.select('//div[#class="stakedLW"]/h1/text()').extract()
item['Value'] = m.select('//p[#class="value"]/text()').extract()
item['Value'] = m.select('//div[#class="value"]/span/span/text()').extract()
items.append(item)
return items
to retrieve data from this html source code
<div class="LWimg">
<div class="stakedLW">
<span class="title">Stake</span>
<span class="value">5.00</span>
<span class="currency"></span>
My items.py looks like this
from scrapy.item import Item, Field
class Page(Item):
Stake = Field()
Value = Field()
The problem is that data is not retrieved, i.e. nothing is saved into a .csv in the end.
Any input is welcome.
You are populating the Value field twice, so just the last one will work, and I think the correct way should be:
item['Value'] = response.xpath('//div[#class="stakedLW"]//span[#class="value"]/text()').extract_first()
The other fields are not necessary, just the link one.
So I have scrapy working really well. It's grabbing data out of a page, but the problem I'm running into is that sometimes the page's table order is different. For example, the first page it gets to:
Row name Data
Name 1 data 1
Name 2 data 2
The next page it crawls to might have the order completely different. Where Name 1 was the first row, any other page it might be the 3rd, or 4th etc. The row names are always the same. I was thinking of doing this possibly 1 of 2 different ways, I'm not sure which will work or which is better.
First option, use some if statements to find the row I need, and then grab the following column. This seems a little messy but could work.
Second option, grab all the data in the table regardless of order and put it in a dict. This way, I can grab the data I need based on row name. This seems like the cleanest approach.
Is there a 3rd option or a better way of doing either?
Here's my code in case it's helpful.
class pageSpider(Spider):
name = "pageSpider"
allowed_domains = ["domain.com"]
start_urls = [
"http://domain.com/stuffs/results",
]
visitedURLs = Set()
def parse(self, response):
products = Selector(response).xpath('//*[#class="itemCell"]')
for product in products:
item = PageScraper()
item['url'] = product.xpath('div[2]/div/a/#href').extract()[0]
urls = Set([product.xpath('div[2]/div/a/#href').extract()[0]])
print urls
for url in urls:
if url not in self.visitedURLs:
request = Request(url, callback=self.productpage)
request.meta['item'] = item
yield request
def productpage(self, response):
specs = Selector(response).xpath('//*[#id="Specs"]')
item = response.meta['item']
for spec in specs:
item['make'] = spec.xpath('fieldset[1]/dl[1]/dd/text()').extract()[0].encode('utf-8', 'ignore')
item['model'] = spec.xpath('fieldset[1]/dl[4]/dd/text()').extract()[0].encode('utf-8', 'ignore')
item['price'] = spec.xpath('fieldset[2]/dl/dd/text()').extract()[0].encode('utf-8', 'ignore')
yield item
The xpaths in productpage can contain data that doesn't correspond to what I need, because the order changed.
Edit:
I'm trying the dict approach and I think this is the best option.
def productpage(self, response):
specs = Selector(response).xpath('//*[#id="Specs"]/fieldset')
itemdict = {}
for i in specs:
test = i.xpath('dl')
for t in test:
itemdict[t.xpath('dt/text()').extract()[0]] = t.xpath('dd/text()').extract()[0]
item = response.meta['item']
item['make'] = itemdict['Brand']
yield item
This seems like the best and cleanest approach (using dict)
def productpage(self, response):
specs = Selector(response).xpath('//*[#id="Specs"]/fieldset')
itemdict = {}
for i in specs:
test = i.xpath('dl')
for t in test:
itemdict[t.xpath('dt/text()').extract()[0]] = t.xpath('dd/text()').extract()[0]
item = response.meta['item']
item['make'] = itemdict['Brand']
yield item
I have been wondering what would be the best way to scrap the multi level of data using scrapy
I will describe the situation in four stage,
current architecture that i am following to scrape this data
basic code structure
the difficulties and why i think there has to be a better option
The format in which i have tried to store the data and failed and then succeeded partially
Current Architecture
the data structure
First page : List of Artist
Second page : List of Album for each Artist
Third Page : list of Songs for each Album
basic code structure
class MusicLibrary(Spider):
name = 'MusicLibrary'
def parse(self, response):
items = Discography()
items['artists'] = []
for artist in artists:
item = Artist()
item['albums'] = []
item['artist_name'] = "name"
items['artists'].append(item)
album_page_url = "extract link to album and yield that page"
yield Request(album_page_url,
callback=self.parse_album,
meta={'item': items,
'artist_name': item['artist_name']})
def parse_album(self, response):
base_item = response.meta['item']
artist_name = response.meta['artist_name']
# this will search for the artist added in previous method and append album under that artist
artist_index = self.get_artist_index(base_item['artists'], artist_name)
albums = "some path selector"
for album in albums:
item = Album()
item['songs'] = []
item['album_name'] = "name"
base_item['artists'][artist_index]['albums'].append(item)
song_page_url = "extract link to song and yield that page"
yield Request(song_page_url,
callback=self.parse_song_name,
meta={'item': base_item,
"key": item['album_name'],
'artist_index': artist_index})
def parse_song_name(self, response):
base_item = response.meta['item']
album_name = response.meta['key']
artist_index = response.meta["artist_index"]
album_index = self.search(base_item['artists'][artist_index]['albums'], album_name)
songs = "some path selector "
for song in songs:
item = Song()
song_name = "song name"
base_item['artists'][artist_index]['albums'][album_index]['songs'].append(item)
# total_count (total songs to parse) = Main Artist page is having the list of total songs for each artist
# current_count(currently parsed) = i will go to each artist->album->songs->[] and count the length
# i will yield the base_item only when songs to scrape and song scraped count matches
if current_count == total_count:
yield base_item
the difficulties and why i think there has to be a better option
currently i am yielding item object only when all the pages and sub-pages are scraped with condition that the songs to scrape and song scraped count matches..
but give the nature of scraping and volume of scraping ...there are some pages which are to give me code other than (200-status ok) and those songs will not be scraped and item count will not match
so at the end, when even though 90% pages will be scraped successfully and count will not match nothing will be yielded and all CPU power will be lost..
The format in which i have tried to store the data and failed and then succeeded partially
i wanted the data for each item object in single line format
i.e. artistName-Albumname-song name
so if artist A has 1 album (aa) with 8 song ... 8 items will be stores with one entry(item) per song
but with the current format when i have tried yielding every time in last function "parse_song_name" it was yielding that complex structure every time and object was incremental every time...
then i thought the appending everything in first Discography->artist then Artist->albums and then Albums->songs was the problem but when i have removed appending and tried without that i was only yielding one object which is the last one not all..
so finally , developed this work around as described before but it does not work every time ( in case of no 200 status code)
and when it work , after yielding , i have written a pipline where i parse this jSON again and store it in the data format i initially wanted ( one line for each song --flat structure)
can anyone suggest what i am doing wrong here or how can i make this more efficient and make work when some of the pages return non 200 code?
The problem with the code above was:
Mutable object ( list, dict) : and all the callbacks were changing that same object in each loop hence ...first and second level of data was being overwritten in last third loop ( mp3_son_url) ...(this was my failed attempt)
the solution was to use simple copy.deepcopy and create a new object from response.meta object in callback method and not change the base_item object
will try to explain the full answer when i get some time..