Scraping with scrapy - merge fields - python

import scrapy
class ScrapeMovies(scrapy.Spider):
name='conference-papers'
start_urls = [
'http://archive.bridgesmathart.org/2015/index.html'
]
def parse(self, response):
for entry in response.xpath('//div[#class="col-md-9"]'):
yield{
'type': entry.xpath('.//div[#class="h4 alert alert-info"]/text()').extract(),
'title': entry.xpath('.//span[#class="title"]/text()').extract(),
'authors': entry.xpath('.//span[#class="authors"]/text()').extract()
}
Having the following code i want to scrape type, title and author of the every single publication listed. However when i run this i have type, in one line, titles separated with newline and authors at the end in one line.
How to join those three values together? What is the best approach to deal with this?
Here you have excerpt from the html code i want to scrap:
BTW: If you down vote please explain why. I am just curious.

you need to concatenate your values like this: https://stackoverflow.com/a/19418858/6668185
Then you need to get the previous div for each book and get the value which would be something like this: https://stackoverflow.com/a/9857809/6668185
I will improve on this answer w/the exact solution in a sec.
UPDATE/IMPROVEMENT
Try this:
'type': entry.xpath('.//span[#class="title"]/preceding-sibling::div[#class="h4 alert alert-info"]/text()').extract()
I didnt test it, but I think it should work just fine.

Related

Using scrapy to extract and structure table data

I'm new to python and scrapy and thought I'd try out a simple review site to scrape. While most of the site structure is straight forward, I'm having trouble extracting the content of the reviews. This portion is visually laid out in sets of 3 (the text to the right of 良 (good), 悪 (bad), 感 (impressions) fields), but I'm having trouble pulling this content and associating it with a reviewer or section of review due to the use of generic divs, , /n and other formatting.
Any help would be appreciated.
Here's the site and code I've tried for the grabbing them, with some results.
http://www.psmk2.net/ps2/soft_06/rpg/p3_log1.html
(1):
response.xpath('//tr//td[#valign="top"]//text()').getall()
This returns the entire set of reviews, but it contains newline markup and, more of a problem, it renders each line as a separate entry. Due to this, I can't figure out where the good, bad, and impression portions end, nor can I easily parse each separate review as entry length varies.
['\n弱点をついた時のメリット、つかれたときのデメリットがはっきりしてて良い', '\nコミュをあげるのが楽しい',
'\n仲間が多くて誰を連れてくか迷う', '\n難易度はやさしめなので遊びやすい', '\nタルタロスしかダンジョンが無くて飽きる。'........and so forth
(2) As an alternative, I tried:
response.xpath('//tr//td[#valign="top"]')[0].get()
Which actually comes close to what I'd like, save for the markup. Here it seems that it returns the entire field of a review section. Every third element should be the "good" points of each separate review (I've replaced the <> with () to show the raw return).
(td valign="top")\n精一杯考えました(br)\n(br)\n戦闘が面白いですね\n主人公だけですが・・・・(br)\n従来のプレスターンバトルの進化なので(br)\n(br)\n以上です(/td)
(3) Figuring I might be able to get just the text, I then tried:
response.xpath('//tr//td[#valign="top"]//text()')[0].get()
But that only provides each line at a time, with the \n at the front. As with (1), a line by line rendering makes it difficult to attribute reviews to reviewers and the appropriate section in their review.
From these (2) seems the closest to what I want, and I was hoping I could get some direction in how to grab each section for each review without the markup. I was thinking that since these sections come in sets of 3, if these could be put in a list that would make pulling them easier in the future (i.e. all "good" reviews follow 0, 0+3; all "bad" ones 1, 1+3 ... etc.)...but first I need to actually get the elements.
I've thought about, and tried, iterating over each line with an "if" conditional (something like:)
i = 0
if i <= len(response.xpath('//tr//td[#valign="top"]//text()').getall()):
yield {response.xpath('//tr//td[#valign="top"]')[i].get()}
i + 1
to pull these out, but I'm a bit lost on how to implement something like this. Not sure where it should go. I've briefly looked at Item Loader, but as I'm new to this, I'm still trying to figure it out.
Here's the block where the review code is.
def parse(self, response):
for table in response.xpath('body'):
yield {
#code for other elements in review
'date': response.xpath('//td//div[#align="left"]//text()').getall(),
'name': response.xpath('//td//div[#align="right"]//text()').getall(),
#this includes the above elements, and is regualr enough I can systematically extract what I want
'categories': response.xpath('//tr//td[#class="koumoku"]//text()').getall(),
'scores': response.xpath('//tr//td[#class="tokuten_k"]//text()').getall(),
'play_time': response.xpath('//td[#align="right"]//span[#id="setumei"]//text()').getall(),
#reviews code here
}
Pretty simple task using a part of text as anchor (I used string to get text content for a whole td):
for review_node in response.xpath('//table[#width="645"]'):
good = review_node.xpath('string(.//td[b[starts-with(., "良")]]/following-sibling::td[1])').get()
bad= review_node.xpath('string(.//td[b[starts-with(., "悪")]]/following-sibling::td[1])').get()
...............

Splitting a Scrapy element among multiple CSV rows

I've been working on something that I think should be relatively easy but I keep hitting my head against a wall. I've tried multiple similar solutions from stackoverflow and I've improved my code but still stuck on the basic functionality.
I am scraping a web page that returns an element (genre) that is essential a list of genres:
Mystery, Comedy, Horror, Drama
The xpath returns perfectly. I'm using a Scrapy pipeline to output to a CSV file. What I'd like to do is create a separate row for each item in the above list along with the page url:
"Mystery", "http:domain.com/page1.html"
"Comedy", "http:domain.com/page1.html"
No matter what I try I can only output:
"Mystery, Comedy, Horror, Drama", ""http:domain.com/page1.html"
Here's my code:
def parse_genre (self, response):
for item in [i.split (',') for i in response.xpath ('//span [contains (#class, "genre")]/text()').extract()]:
sg = ItemLoader (item=ItemGenre (), response=response)
sg.add_value ('url', response.url)
sg.add_value ('genre', item, MapCompose(str.strip))
yield sg.load_item ()
This is called from the main parse routine for the spider. That all functions correctly. (I have two items on each web page. The main spider gathers the "parent" information and this function is attempting to gather "child" information. Technically not a child record, but definitely a 1 to many relationship.)
I've tried a number of possible solutions. This is the only version that makes sense to me and seems like it should work. I'm sure I'm just not splitting the genre string correctly.
You are very close.
Your culprit seems to be the way you are getting your items:
[i.split(',') for i in response.xpath('//span[contains(#class, "genre")]/text()').extract()]
Without having the source I can't correct you fully but it is obvious here your code is returning a list of lists.
You should either flatten this list of lists into list of strings or iterate through it appropriately:
items = response.xpath('//span[contains (#class, "genre")]/text()').extract()]
for item in items:
for category in item.split(','):
sg = ItemLoader(item=ItemGenre(), response=response)
sg.add_value('url', response.url)
sg.add_value('genre', category, MapCompose(str.strip))
yield sg.load_item ()
Alternative more advance technique would be to use list nested comprehension:
items = response.xpath('//span[contains (#class, "genre")]/text()').extract()]
# good cheatsheet to remember this [leaf for tree in forest for leaf in tree]
categories = [cat for item in items for cat in items]
for category in categories:
sg = ItemLoader(item=ItemGenre(), response=response)
sg.add_value('url', response.url)
sg.add_value('genre', category, MapCompose(str.strip))
yield sg.load_item ()

Cleaning data scraped using Scrapy

I have recently started using Scrapy and am trying to clean some data I have scraped and want to export to CSV, namely the following three examples:
Example 1 – removing certain text
Example 2 – removing/replacing unwanted characters
Example 3 –splitting comma separated text
Example 1 data looks like:
Text I want,Text I don’t want
Using the following code:
'Scraped 1': response.xpath('//div/div/div/h1/span/text()').extract()
Example 2 data looks like:
 - but I want to change this to £
Using the following code:
' Scraped 2': response.xpath('//html/body/div/div/section/div/form/div/div/em/text()').extract()
Example 3 data looks like:
Item 1,Item 2,Item 3,Item 4,Item 4,Item5 – ultimately I want to split
this into separate columns in a CSV file
Using the following code:
' Scraped 3': response.xpath('//div/div/div/ul/li/p/text()').extract()
I have tried using str.replace(), but can’t seem to get that to work, e.g:
'Scraped 1': response.xpath('//div/div/div/h1/span/text()').extract((str.replace(",Text I don't want",""))
I am looking into this but what appreciate if anyone could point me in the right direction!
Code below:
import scrapy
from scrapy.loader import ItemLoader
from tutorial.items import Product
class QuotesSpider(scrapy.Spider):
name = "quotes_product"
start_urls = [
'http://www.unitestudents.com/',
]
# Step 1
def parse(self, response):
for city in response.xpath('//select[#id="frm_homeSelect_city"]/option[not(contains(text(),"Select your city"))]/text()').extract(): # Select all cities listed in the select (exclude the "Select your city" option)
yield scrapy.Request(response.urljoin("/"+city), callback=self.parse_citypage)
# Step 2
def parse_citypage(self, response):
for url in response.xpath('//div[#class="property-header"]/h3/span/a/#href').extract(): #Select for each property the url
yield scrapy.Request(response.urljoin(url), callback=self.parse_unitpage)
# Step 3
def parse_unitpage(self, response):
for final in response.xpath('//div/div/div[#class="content__btn"]/a/#href').extract(): #Select final page for data scrape
yield scrapy.Request(response.urljoin(final), callback=self.parse_final)
#Step 4
def parse_final(self, response):
unitTypes = response.xpath('//html/body/div').extract()
for unitType in unitTypes: # There can be multiple unit types so we yield an item for each unit type we can find.
l = ItemLoader(item=Product(), response=response)
l.add_xpath('area_name', '//div/ul/li/a/span/text()')
l.add_xpath('type', '//div/div/div/h1/span/text()')
l.add_xpath('period', '/html/body/div/div/section/div/form/h4/span/text()')
l.add_xpath('duration_weekly', '//html/body/div/div/section/div/form/div/div/em/text()')
l.add_xpath('guide_total', '//html/body/div/div/section/div/form/div/div/p/text()')
l.add_xpath('amenities','//div/div/div/ul/li/p/text()')
return l.load_item()
However, I'm getting the following?
value = self.item.fields[field_name].get(key, default)
KeyError: 'type'
You have the right idea with str.replace, although I would suggest the Python 're' regular expressions library as it is more powerful. The documentation is top notch and you can find some useful code samples there.
I am not familiar with the scrapy library, but it looks like .extract() returns a list of strings. If you want to transform these using str.replace or one of the regex functions, you will need to use a list comprehension:
'Selector 1': [ x.replace('A', 'B') for x in response.xpath('...').extract() ]
Edit: Regarding the separate columns-- if the data is already comma-separated just write it directly to a file! If you want to split the comma-separated data to do some transformations, you can use str.split like this:
"A,B,C".split(",") # returns [ "A", "B", "C" ]
In this case, the data returned from .extract() will be a list of comma-separated strings. If you use a list comprehension as above, you will end up with a list-of-lists.
If you want something more sophisticated than splitting on each comma, you can use python's csv library.
It would be much easier to provide a more specific answer if you would have provided your spider and item definitions. Here are some generic guidelines.
If you want to keep things modular and follow the Scrapy's suggest project architecture and separation of concerns, you should be cleaning and preparing your data for further export via Item Loaders with input and output processors.
For the first two examples, MapCompose looks like a good fit.

FOR loop should yield multiple results, but only yields one

I'm trying to pull very specific elements from a dictionary of RSS data that was fetched using the feedparser library, then place that data into a new dictionary so it can be called on later using Flask. The reason I'm doing this is because the original dictionary contains tons of metadata I don't need.
I have broken down the process into simple steps but keep getting hung up on creating the new dictionary! As it is below, it does create a dictionary object, but it's not comprehensive-- it only contains a single article's title, URL and description-- the rest is absent.
I've tried switching to other RSS feeds and had the same result, so it would appear the problem is either the way I'm trying to do it, or there's something wrong with the structure of the list generated by feedparser.
Here's my code:
from html.parser import HTMLParser
import feedparser
def get_feed():
url = "http://thefreethoughtproject.com/feed/"
front_page = feedparser.parse(url)
return front_page
feed = get_feed()
# make a dictionary to update with the vital information
posts = {}
for i in range(0, len(feed['entries'])):
posts.update({
'title': feed['entries'][i].title,
'description': feed['entries'][i].summary,
'url': feed['entries'][i].link,
})
print(posts)
Ultimately, I'd like to have a dictionary like the following, except that it keeps going with more articles:
[{'Title': 'Trump Does Another Ridiculous Thing',
'Description': 'Witnesses looked on in awe as the Donald did this thing',
'Link': 'SomeNewsWebsite.com/Story12345'},
{...},
{...}]
Something tells me it's a simple mistake-- perhaps the syntax is off, or I'm forgetting a small yet important detail.
The code example you provided does an update to the same dict over and over again. So, you only get one dict at the end of the loop. What your example data shows, is that you actually want a list of dictionaries:
# make a list to update with the vital information
posts = []
for entry in feed['entries']:
posts.append({
'title': entry.title,
'description': entry.summary,
'url': entry.link,
})
print(posts)
Seems that the problem is that you are using a dict instead of a list. Then you are updating the same keys of the dict, so each iteration you are overriding the last content added.
I think that the following code will solve your problem:
from html.parser import HTMLParser
import feedparser
def get_feed():
url = "http://thefreethoughtproject.com/feed/"
front_page = feedparser.parse(url)
return front_page
feed = get_feed()
# make a dictionary to update with the vital information
posts = [] # It should be a list
for i in range(0, len(feed['entries'])):
posts.append({
'title': feed['entries'][i].title,
'description': feed['entries'][i].summary,
'url': feed['entries'][i].link,
})
print(posts)
So as you can see the code above are defining the posts variable as a list. Then in the loop we are adding dicts to this list, so it will give you the data structure that you want.
I hope to help you with this solution.

Scrapy, Crawling Reviews on Tripadvisor: extract more hotel and user information

in need to extract more information from tripAdvisor
my code:
item = TripadvisorItem()
item['url'] = response.url.encode('ascii', errors='ignore')
item['state'] = hxs.xpath('//*[#id="PAGE"]/div[2]/div[1]/ul/li[2]/a/span/text()').extract()[0].encode('ascii', errors='ignore')
if(item['state']==[]):
item['state']=hxs.xpath('//*[#id="HEADING_GROUP"]/div[2]/address/span/span/span[contains(#class,"region_title")][2]/text()').extract()
item['city'] = hxs.select('//*[#id="PAGE"]/div[2]/div[1]/ul/li[3]/a/span/text()').extract()
if(item['city']==[]):
item['city'] =hxs.xpath('//*[#id="HEADING_GROUP"]/div[2]/address/span/span/span[1]/span/text()').extract()
if(item['city']==[]):
item['city']=hxs.xpath('//*[#id="HEADING_GROUP"]/div[2]/address/span/span/span[3]/span/text()').extract()
item['city']= item['city'][0].encode('ascii', errors='ignore')
item['hotelName'] = hxs.xpath('//*[#id="HEADING"]/span[2]/span/a/text()').extract()
item['hotelName']=item['hotelName'][0].encode('ascii', errors='ignore')
reviews = hxs.select('.//div[contains(#id, "review")]')
1. For every hotel in tripAdvisor, there is a id number for the hotel. like 80075 for this hotel: http://www.tripadvisor.com/Hotel_Review-g60763-d80075-Reviews-Amsterdam_Court_Hotel-New_York_City_New_York.html#REVIEWS
how can i extract this id from the TA item?
More information i need for every hotel : shortDescription, stars, zipCode, country and coordinates(long, lat). Can i extract this things?
i need to extract for every review the traveller type. how?
my code for review:
for review in reviews:
it = Review()
it['state'] = item['state']
it['city'] = item['city']
it['hotelName'] = item['hotelName']
it['date'] = review.xpath('.//div[1]/div[2]/div/div[2]/span[2]/#title').extract()
if(it['date']==[]):
it['date']=review.xpath('.//div[1]/div[2]/div/div[2]/span[2]/text()').extract()
if(it['date']!=[]):
it['date']=it['date'][0].encode('ascii', errors='ignore').replace("Reviewed","").strip()
it['userName'] = review.xpath('.//div[contains(#class,"username mo")]/span/text()').extract()
if (it['userName']!=[]):
it['userName']=it['userName'][0].encode('ascii', errors='ignore')
it['userLocation'] = ''.join(review.xpath('.//div[contains(#class,"location")]/text()').extract()).strip().encode('ascii', errors='ignore')
it['reviewTitle'] = review.xpath('.//div[1]/div[2]/div[1]/div[contains(#class,"quote")]/text()').extract()
if(it['reviewTitle']!=[]):
it['reviewTitle']=it['reviewTitle'][0].encode('ascii', errors='ignore')
else:
it['reviewTitle'] = review.xpath('.//div[1]/div[2]/div/div[1]/a/span[contains(#class,"noQuotes")]/text()').extract()
if(it['reviewTitle']!=[]):
it['reviewTitle']=it['reviewTitle'][0].encode('ascii', errors='ignore')
it['reviewContent'] = review.xpath('.//div[1]/div[2]/div[1]/div[3]/p/text()').extract()
if(it['reviewContent']!=[]):
it['reviewContent']=it['reviewContent'][0].encode('ascii', errors='ignore').strip()
it['generalRating'] = review.xpath('.//div/div[2]/div/div[2]/span[1]/img/#alt').extract()
if(it['generalRating']!=[]):
it['generalRating'] =it['generalRating'][0].encode('ascii', errors='ignore').split()[0]
there is a good manual how to find these things? i lost myself with all the spans and the divs..
thanks!
I'll try to do this in purely XPath. Unfortunately, it looks like most of the info you want is contained in <script> tags:
Hotel ID - Returns "80075"
substring-before(normalize-space(substring-after(//script[contains(., "geoId:") and contains(., "lat")]/text(), "locId:")), ",")
Alternatively, the Hotel ID is in the URL, as another answerer mentioned. If you're sure the format will always be the same (such as including a "d" prior to the ID), then you can use that instead.
Rating (the one at the top) - Returns "3.5"
//span[contains(#class, "rating_rr")]/img/#content
There are a couple instances of ratings on this page. The main rating at the top is what I've grabbed here. I haven't tested this within Scrapy, so it's possible that it's popoulated by JavaScript and not initially loaded as part of the HTML. If that's the case, you'll need to grab it somewhere else or use something like Selenium/PhantomJS.
Zip Code - Returns "10019"
(//span[#property="v:postal-code"]/text())[1]
Again, same deal as above. It's in the HTML, but you should check whether it's there upon page load.
Country - Returns ""US""
substring-before(substring-after(//script[contains(., "modelLocaleCountry")]/text(), "modelLocaleCountry = "), ";")
This one comes with quotes. You can always (and you should) use a pipeline to sanitize scraped data to get it to look the way you want.
Coordinates - Returns "40.76174" and "-73.985275", respectively
Lat: substring-before(normalize-space(substring-after(//script[contains(., "geoId:") and contains(., "lat")]/text(), "lat:")), ",")
Lon: substring-before(normalize-space(substring-after(//script[contains(., "geoId:") and contains(., "lat")]/text(), "lng:")), ",")
I'm not entirely sure where the short description exists on this page, so I didn't include that. It's possible you have to navigate elsewhere to get it. I also wasn't 100% sure what the "traveler type" meant, so I'll leave that one up to you.
As far as a manual, it's really about practice. You learn tricks and hacks for working within XPath, and Scrapy allows you to use some added features, such as regex and pipelines. I wouldn't recommend doing the whole "absolute path" XPath (i.e., ./div/div[3]/div[2]/ul/li[3]/...), since any deviation from that within the DOM will completely ruin your scraping. If you have a lot of data to scrape, and you plan on keeping this around a while, your project will become unmanageable very quickly if any site moves around even a single <div>.
I'd recommend more "querying" XPaths, such as //div[contains(#class, "foo")]//a[contains(#href, "detailID")]. Paths like that will make sure that no matter how many elements are placed between the elements you know will be there, and even if multiple target elements are slightly different from each other, you'll be able to grab them consistently.
XPaths are a lot of trial and error. A LOT. Here are a few tools that help me out significantly:
XPath Helper (Chrome extension)
scrapy shell <URL>
scrapy view <URL> (for rendering Scrapy's response in a browser)
PhantomJS (if you're interested in getting data that's been inserted via JavaScript)
Hope some of this helped.
Is it acceptable to get it from the URL using a regex?
id = re.search('(-d)([0-9]+)',url).group(2)

Categories

Resources