I've been working on something that I think should be relatively easy but I keep hitting my head against a wall. I've tried multiple similar solutions from stackoverflow and I've improved my code but still stuck on the basic functionality.
I am scraping a web page that returns an element (genre) that is essential a list of genres:
Mystery, Comedy, Horror, Drama
The xpath returns perfectly. I'm using a Scrapy pipeline to output to a CSV file. What I'd like to do is create a separate row for each item in the above list along with the page url:
"Mystery", "http:domain.com/page1.html"
"Comedy", "http:domain.com/page1.html"
No matter what I try I can only output:
"Mystery, Comedy, Horror, Drama", ""http:domain.com/page1.html"
Here's my code:
def parse_genre (self, response):
for item in [i.split (',') for i in response.xpath ('//span [contains (#class, "genre")]/text()').extract()]:
sg = ItemLoader (item=ItemGenre (), response=response)
sg.add_value ('url', response.url)
sg.add_value ('genre', item, MapCompose(str.strip))
yield sg.load_item ()
This is called from the main parse routine for the spider. That all functions correctly. (I have two items on each web page. The main spider gathers the "parent" information and this function is attempting to gather "child" information. Technically not a child record, but definitely a 1 to many relationship.)
I've tried a number of possible solutions. This is the only version that makes sense to me and seems like it should work. I'm sure I'm just not splitting the genre string correctly.
You are very close.
Your culprit seems to be the way you are getting your items:
[i.split(',') for i in response.xpath('//span[contains(#class, "genre")]/text()').extract()]
Without having the source I can't correct you fully but it is obvious here your code is returning a list of lists.
You should either flatten this list of lists into list of strings or iterate through it appropriately:
items = response.xpath('//span[contains (#class, "genre")]/text()').extract()]
for item in items:
for category in item.split(','):
sg = ItemLoader(item=ItemGenre(), response=response)
sg.add_value('url', response.url)
sg.add_value('genre', category, MapCompose(str.strip))
yield sg.load_item ()
Alternative more advance technique would be to use list nested comprehension:
items = response.xpath('//span[contains (#class, "genre")]/text()').extract()]
# good cheatsheet to remember this [leaf for tree in forest for leaf in tree]
categories = [cat for item in items for cat in items]
for category in categories:
sg = ItemLoader(item=ItemGenre(), response=response)
sg.add_value('url', response.url)
sg.add_value('genre', category, MapCompose(str.strip))
yield sg.load_item ()
Related
In scrapinghub how can we achieve multiple items exporting?
I have MainItem() and a SubItem() item classes and I would like to get two separate items in scrapinghub item's page.
I can do this by implementing different item pipelines for both
items in a normal crawling but how this can be achieved in
scrapinghub? As of now, I'm getting only MainItem() objects in
items page.
sample code snippet given below,
def parse_data(self, response):
.
.
.
# main item fields
m_item = MainItem()
m_item['a'] = 'A'
m_item['b'] = 'B'
yield m_item
# sub item fields
s_item = SubItem()
s_item['c'] = 'C'
s_item['d'] = 'D'
yield s_item
Here in scrapinghub I'm able to view only MainItems() fields
Can you provide more information? The spider code and logs, I can't see any problem with your example.
Scrapy Cloud does allow a spider to yield different items. These items can be filtered later using the SC interface.
import scrapy
class ScrapeMovies(scrapy.Spider):
name='conference-papers'
start_urls = [
'http://archive.bridgesmathart.org/2015/index.html'
]
def parse(self, response):
for entry in response.xpath('//div[#class="col-md-9"]'):
yield{
'type': entry.xpath('.//div[#class="h4 alert alert-info"]/text()').extract(),
'title': entry.xpath('.//span[#class="title"]/text()').extract(),
'authors': entry.xpath('.//span[#class="authors"]/text()').extract()
}
Having the following code i want to scrape type, title and author of the every single publication listed. However when i run this i have type, in one line, titles separated with newline and authors at the end in one line.
How to join those three values together? What is the best approach to deal with this?
Here you have excerpt from the html code i want to scrap:
BTW: If you down vote please explain why. I am just curious.
you need to concatenate your values like this: https://stackoverflow.com/a/19418858/6668185
Then you need to get the previous div for each book and get the value which would be something like this: https://stackoverflow.com/a/9857809/6668185
I will improve on this answer w/the exact solution in a sec.
UPDATE/IMPROVEMENT
Try this:
'type': entry.xpath('.//span[#class="title"]/preceding-sibling::div[#class="h4 alert alert-info"]/text()').extract()
I didnt test it, but I think it should work just fine.
Set-up
I'm scraping housing ads with scrapy: per housing ad I scrape several housing characteristics.
Scraping the housing characteristics works fine.
Problem
Besides the housing characteristics, I want to scrape one image per ad.
I have the following code:
class ApartmentSpider(scrapy.Spider):
name = 'apartments'
start_urls = [
'http://www.jaap.nl/huurhuizen/noord+holland/groot-amsterdam/amsterdam'
]
def parse(self, response):
for href in response.xpath(
'//*[#id]/a',
).css("a.property-inner::attr(href)").extract():
yield scrapy.Request(response.urljoin(href),
callback=self.parse_ad) # parse_ad() scrapes housing characteristics
yield scrapy.Request(response.urljoin(href),
callback=self.parse_AdImage) # parse_AdImage() obtains one image per ad
So, I've got two yield commands, which does not work. That is, I get the characteristics, but not the images.
I can comment the first one, such that I get the images.
How do I fix this such that I get both? Thanks in advance.
Just yield them both together.
yield (scrapy.Request(response.urljoin(href), callback=self.parse_ad), scrapy.Request(response.urljoin(href), callback=self.parse_AdImage))
On the receiving end, grab both as separate values
characteristics, image = ApartmentSpider.parse(response)
I have two major suggestions:
Number 1
I would strongly suggest re-working your code to actually farm out all the info at the same time. Instead of having two separate parse_X functions...just have one that gets the info and returns a single item.
Number 2
Implement a Spider Middleware that does merging/splitting similar to what I have below for pipelines. A simple example middleware is https://github.com/scrapy/scrapy/blob/ebef6d7c6dd8922210db8a4a44f48fe27ee0cd16/scrapy/spidermiddlewares/urllength.py. You would simply merge items and track them here before they enter the itempipelines.
WARNING DO NOT DO WHAT's BELOW. I WAS GOING TO SUGGEST THIS, AND THE CODE MIGHT WORK...BUT WITH SOME POTENTIALLY HIDDEN ISSUES.
IT IS HERE FOR COMPLETENESS OF WHAT I WAS RESEARCHING -- IT IS RECOMMENDED AGAINST HERE:https://github.com/scrapy/scrapy/issues/1915
Use the item processing pipelines in scrapy. They are incredibly useful for accumulating data. Have a item joiner pipeline who's purpose is to wait for the two separate partial data items and concatenate them into one item and key them on the ad id (or some other unique piece of data).
In rough not-runnable psuedocode:
class HousingItemPipeline(object):
def __init__():
self.assembledItems = dict()
def process_item(self, item, spider):
if type(item, PartialAdHousingItem):
self.assembledItems[unique_id] = AssembledHousingItem()
self.assembledItems[unique_id]['field_of_interst'] = ...
...assemble more data
raise DropItem("Assembled it's data")
if type(item, PartialAdImageHousingItem):
self.assembledItems[unique_id]['field_of_interst'] = ...
...assemble more data
raise DropItem("Assembled it's data")
if Fully Assembled:
return self.assembledItems.pop(unique_id)
I have recently started using Scrapy and am trying to clean some data I have scraped and want to export to CSV, namely the following three examples:
Example 1 – removing certain text
Example 2 – removing/replacing unwanted characters
Example 3 –splitting comma separated text
Example 1 data looks like:
Text I want,Text I don’t want
Using the following code:
'Scraped 1': response.xpath('//div/div/div/h1/span/text()').extract()
Example 2 data looks like:
 - but I want to change this to £
Using the following code:
' Scraped 2': response.xpath('//html/body/div/div/section/div/form/div/div/em/text()').extract()
Example 3 data looks like:
Item 1,Item 2,Item 3,Item 4,Item 4,Item5 – ultimately I want to split
this into separate columns in a CSV file
Using the following code:
' Scraped 3': response.xpath('//div/div/div/ul/li/p/text()').extract()
I have tried using str.replace(), but can’t seem to get that to work, e.g:
'Scraped 1': response.xpath('//div/div/div/h1/span/text()').extract((str.replace(",Text I don't want",""))
I am looking into this but what appreciate if anyone could point me in the right direction!
Code below:
import scrapy
from scrapy.loader import ItemLoader
from tutorial.items import Product
class QuotesSpider(scrapy.Spider):
name = "quotes_product"
start_urls = [
'http://www.unitestudents.com/',
]
# Step 1
def parse(self, response):
for city in response.xpath('//select[#id="frm_homeSelect_city"]/option[not(contains(text(),"Select your city"))]/text()').extract(): # Select all cities listed in the select (exclude the "Select your city" option)
yield scrapy.Request(response.urljoin("/"+city), callback=self.parse_citypage)
# Step 2
def parse_citypage(self, response):
for url in response.xpath('//div[#class="property-header"]/h3/span/a/#href').extract(): #Select for each property the url
yield scrapy.Request(response.urljoin(url), callback=self.parse_unitpage)
# Step 3
def parse_unitpage(self, response):
for final in response.xpath('//div/div/div[#class="content__btn"]/a/#href').extract(): #Select final page for data scrape
yield scrapy.Request(response.urljoin(final), callback=self.parse_final)
#Step 4
def parse_final(self, response):
unitTypes = response.xpath('//html/body/div').extract()
for unitType in unitTypes: # There can be multiple unit types so we yield an item for each unit type we can find.
l = ItemLoader(item=Product(), response=response)
l.add_xpath('area_name', '//div/ul/li/a/span/text()')
l.add_xpath('type', '//div/div/div/h1/span/text()')
l.add_xpath('period', '/html/body/div/div/section/div/form/h4/span/text()')
l.add_xpath('duration_weekly', '//html/body/div/div/section/div/form/div/div/em/text()')
l.add_xpath('guide_total', '//html/body/div/div/section/div/form/div/div/p/text()')
l.add_xpath('amenities','//div/div/div/ul/li/p/text()')
return l.load_item()
However, I'm getting the following?
value = self.item.fields[field_name].get(key, default)
KeyError: 'type'
You have the right idea with str.replace, although I would suggest the Python 're' regular expressions library as it is more powerful. The documentation is top notch and you can find some useful code samples there.
I am not familiar with the scrapy library, but it looks like .extract() returns a list of strings. If you want to transform these using str.replace or one of the regex functions, you will need to use a list comprehension:
'Selector 1': [ x.replace('A', 'B') for x in response.xpath('...').extract() ]
Edit: Regarding the separate columns-- if the data is already comma-separated just write it directly to a file! If you want to split the comma-separated data to do some transformations, you can use str.split like this:
"A,B,C".split(",") # returns [ "A", "B", "C" ]
In this case, the data returned from .extract() will be a list of comma-separated strings. If you use a list comprehension as above, you will end up with a list-of-lists.
If you want something more sophisticated than splitting on each comma, you can use python's csv library.
It would be much easier to provide a more specific answer if you would have provided your spider and item definitions. Here are some generic guidelines.
If you want to keep things modular and follow the Scrapy's suggest project architecture and separation of concerns, you should be cleaning and preparing your data for further export via Item Loaders with input and output processors.
For the first two examples, MapCompose looks like a good fit.
I have to crawl the following url, which basically contains reviews. all the reviews there, has a review writer name, title for a review, and a review itself.
I've chosen "python-scrapy" to do this task.
But the url mentioned is not the start url, I will be obtaining it from the basic parse method. And in parse I will initialize a itemLoder. I will extract few items there and pass the items via meta of the response. (the extracted field contains information such as hotel name, Address, pricing etc....)
I have also declared items, namely "review_member_name", "review_quote", "review_review"
There are more than one review in the page and the review id for an review can be obtained from the response.url. (shown in parse method below)
since there are many reviews and all will share the same item name, the items get concatenated which should not happen. Can anybody suggest me a way to solve this?
below is my source for parse_review.
def parse_review(self,response):
review_nos = re.search(".*www\.tripadvisor\.in/ExpandedUserReviews-.*context=1&reviews=(.+)&servlet=Hotel_Review&expand=1",response.url).group(1)
review_nos = review_nos.split(',') # list of review ids
for review_no in review_nos:
item = response.meta['item']
#item = ItemLoader(item=TripadvisorItem(), response=response) - this works fine but I will lose the items from parse method
div_id = "expanded_review_"+review_no
review = response.xpath('/html/body/div[#id="%s"]'%div_id)
member_name = review.xpath('.//div[#class="member_info"]//div[#class="username mo"]//text()').extract()
if member_name:
item.add_value('review_member_name', member_name)
review_quote = review.xpath('.//div[#class="innerBubble"]/div[#class="quote"]//text()').extract()
if review_quote:
item.add_value('review_quote', review_quote)
review_entry = review.xpath('.//div[#class="innerBubble"]/div[#class="entry"]//text()').extract()
if review_entry:
item.add_value('review_review', review_entry)
yield item.load_item()
following is my items.json ("review_review" is being removed and the items from parse method too is removed)
[{"review_quote": "\u201c Fabulous service \u201d", "review_member_name": "VimalPrakash"},
{"review_quote": "\u201c Fabulous service \u201d \u201c Indian hospitality at its best, and honestly the best coffee in India \u201d", "review_member_name": "VimalPrakash Jessica P"},
{"review_quote": "\u201c Fabulous service \u201d \u201c Indian hospitality at its best, and honestly the best coffee in India \u201d \u201c Nice hotel in a central location \u201d", "review_member_name": "VimalPrakash Jessica P VikInd"}]
And please suggest a good title for this question.
You'll have to create a new ItemLoader before doing add_value on it; now you're creating one item, and adding new values to it again and again in the loop.
for review_no in review_nos:
item = ItemLoader(item=projectItem(), response=response)
...
yield item.load_item()
You can also use .add_xpath directly with the xpath you're supplying, and use response.xpath as the selector for the item when creating the ItemLoader, that way you can probably avoid all the if tests and let the load do what it should do: load items.