I have to crawl the following url, which basically contains reviews. all the reviews there, has a review writer name, title for a review, and a review itself.
I've chosen "python-scrapy" to do this task.
But the url mentioned is not the start url, I will be obtaining it from the basic parse method. And in parse I will initialize a itemLoder. I will extract few items there and pass the items via meta of the response. (the extracted field contains information such as hotel name, Address, pricing etc....)
I have also declared items, namely "review_member_name", "review_quote", "review_review"
There are more than one review in the page and the review id for an review can be obtained from the response.url. (shown in parse method below)
since there are many reviews and all will share the same item name, the items get concatenated which should not happen. Can anybody suggest me a way to solve this?
below is my source for parse_review.
def parse_review(self,response):
review_nos = re.search(".*www\.tripadvisor\.in/ExpandedUserReviews-.*context=1&reviews=(.+)&servlet=Hotel_Review&expand=1",response.url).group(1)
review_nos = review_nos.split(',') # list of review ids
for review_no in review_nos:
item = response.meta['item']
#item = ItemLoader(item=TripadvisorItem(), response=response) - this works fine but I will lose the items from parse method
div_id = "expanded_review_"+review_no
review = response.xpath('/html/body/div[#id="%s"]'%div_id)
member_name = review.xpath('.//div[#class="member_info"]//div[#class="username mo"]//text()').extract()
if member_name:
item.add_value('review_member_name', member_name)
review_quote = review.xpath('.//div[#class="innerBubble"]/div[#class="quote"]//text()').extract()
if review_quote:
item.add_value('review_quote', review_quote)
review_entry = review.xpath('.//div[#class="innerBubble"]/div[#class="entry"]//text()').extract()
if review_entry:
item.add_value('review_review', review_entry)
yield item.load_item()
following is my items.json ("review_review" is being removed and the items from parse method too is removed)
[{"review_quote": "\u201c Fabulous service \u201d", "review_member_name": "VimalPrakash"},
{"review_quote": "\u201c Fabulous service \u201d \u201c Indian hospitality at its best, and honestly the best coffee in India \u201d", "review_member_name": "VimalPrakash Jessica P"},
{"review_quote": "\u201c Fabulous service \u201d \u201c Indian hospitality at its best, and honestly the best coffee in India \u201d \u201c Nice hotel in a central location \u201d", "review_member_name": "VimalPrakash Jessica P VikInd"}]
And please suggest a good title for this question.
You'll have to create a new ItemLoader before doing add_value on it; now you're creating one item, and adding new values to it again and again in the loop.
for review_no in review_nos:
item = ItemLoader(item=projectItem(), response=response)
...
yield item.load_item()
You can also use .add_xpath directly with the xpath you're supplying, and use response.xpath as the selector for the item when creating the ItemLoader, that way you can probably avoid all the if tests and let the load do what it should do: load items.
Related
I am creating a news feed scraper so I can collate my favourite football teams news daily. Im an apprentice developer and I thought doing it would increase my knowledge. Just a simple thing to scan one or two sites for just headlines and return the text of the headlines. I have downloaded python, and gained a bit of knowledge around beautiful soup methods and I have managed to find a path directly to each headline on my chosen site, and I have stored these to an array
`page_soup = soup(page_html, "html.parser")` //"parses" the stored data(page_html)
`page_soup.findAll(class_="lakeside__title-text")` //finds all titles on the BBC Liverpool Sports page.
`headline1 = allHeadlines[0]` //create a single entry called "headline1"` from the first slot in our search results
'headline1.text //prints "headline1" string to show its working e.g "'What do you know about Dalglish?(my result)'"
But now I am puzzled as to how to create the loop needed to store the data and display.
for item in allHeadlines{
//something here. im a noob so all i know around this is usually item = item +1
}
print to file etc.,.
Any reading material for me around this topic would be greatly appreciated
Sorry for editing issues, my first ever post.
Assuming allHeadlines is list of objects ( which have method text) .
We can create a list of text from for loop for display or writing to file.
text_headlines = [ item.text for item in allHeadlines if item.text ]
print(text_headlines)
I've been working on something that I think should be relatively easy but I keep hitting my head against a wall. I've tried multiple similar solutions from stackoverflow and I've improved my code but still stuck on the basic functionality.
I am scraping a web page that returns an element (genre) that is essential a list of genres:
Mystery, Comedy, Horror, Drama
The xpath returns perfectly. I'm using a Scrapy pipeline to output to a CSV file. What I'd like to do is create a separate row for each item in the above list along with the page url:
"Mystery", "http:domain.com/page1.html"
"Comedy", "http:domain.com/page1.html"
No matter what I try I can only output:
"Mystery, Comedy, Horror, Drama", ""http:domain.com/page1.html"
Here's my code:
def parse_genre (self, response):
for item in [i.split (',') for i in response.xpath ('//span [contains (#class, "genre")]/text()').extract()]:
sg = ItemLoader (item=ItemGenre (), response=response)
sg.add_value ('url', response.url)
sg.add_value ('genre', item, MapCompose(str.strip))
yield sg.load_item ()
This is called from the main parse routine for the spider. That all functions correctly. (I have two items on each web page. The main spider gathers the "parent" information and this function is attempting to gather "child" information. Technically not a child record, but definitely a 1 to many relationship.)
I've tried a number of possible solutions. This is the only version that makes sense to me and seems like it should work. I'm sure I'm just not splitting the genre string correctly.
You are very close.
Your culprit seems to be the way you are getting your items:
[i.split(',') for i in response.xpath('//span[contains(#class, "genre")]/text()').extract()]
Without having the source I can't correct you fully but it is obvious here your code is returning a list of lists.
You should either flatten this list of lists into list of strings or iterate through it appropriately:
items = response.xpath('//span[contains (#class, "genre")]/text()').extract()]
for item in items:
for category in item.split(','):
sg = ItemLoader(item=ItemGenre(), response=response)
sg.add_value('url', response.url)
sg.add_value('genre', category, MapCompose(str.strip))
yield sg.load_item ()
Alternative more advance technique would be to use list nested comprehension:
items = response.xpath('//span[contains (#class, "genre")]/text()').extract()]
# good cheatsheet to remember this [leaf for tree in forest for leaf in tree]
categories = [cat for item in items for cat in items]
for category in categories:
sg = ItemLoader(item=ItemGenre(), response=response)
sg.add_value('url', response.url)
sg.add_value('genre', category, MapCompose(str.strip))
yield sg.load_item ()
Set-up
I'm scraping housing ads with scrapy: per housing ad I scrape several housing characteristics.
Scraping the housing characteristics works fine.
Problem
Besides the housing characteristics, I want to scrape one image per ad.
I have the following code:
class ApartmentSpider(scrapy.Spider):
name = 'apartments'
start_urls = [
'http://www.jaap.nl/huurhuizen/noord+holland/groot-amsterdam/amsterdam'
]
def parse(self, response):
for href in response.xpath(
'//*[#id]/a',
).css("a.property-inner::attr(href)").extract():
yield scrapy.Request(response.urljoin(href),
callback=self.parse_ad) # parse_ad() scrapes housing characteristics
yield scrapy.Request(response.urljoin(href),
callback=self.parse_AdImage) # parse_AdImage() obtains one image per ad
So, I've got two yield commands, which does not work. That is, I get the characteristics, but not the images.
I can comment the first one, such that I get the images.
How do I fix this such that I get both? Thanks in advance.
Just yield them both together.
yield (scrapy.Request(response.urljoin(href), callback=self.parse_ad), scrapy.Request(response.urljoin(href), callback=self.parse_AdImage))
On the receiving end, grab both as separate values
characteristics, image = ApartmentSpider.parse(response)
I have two major suggestions:
Number 1
I would strongly suggest re-working your code to actually farm out all the info at the same time. Instead of having two separate parse_X functions...just have one that gets the info and returns a single item.
Number 2
Implement a Spider Middleware that does merging/splitting similar to what I have below for pipelines. A simple example middleware is https://github.com/scrapy/scrapy/blob/ebef6d7c6dd8922210db8a4a44f48fe27ee0cd16/scrapy/spidermiddlewares/urllength.py. You would simply merge items and track them here before they enter the itempipelines.
WARNING DO NOT DO WHAT's BELOW. I WAS GOING TO SUGGEST THIS, AND THE CODE MIGHT WORK...BUT WITH SOME POTENTIALLY HIDDEN ISSUES.
IT IS HERE FOR COMPLETENESS OF WHAT I WAS RESEARCHING -- IT IS RECOMMENDED AGAINST HERE:https://github.com/scrapy/scrapy/issues/1915
Use the item processing pipelines in scrapy. They are incredibly useful for accumulating data. Have a item joiner pipeline who's purpose is to wait for the two separate partial data items and concatenate them into one item and key them on the ad id (or some other unique piece of data).
In rough not-runnable psuedocode:
class HousingItemPipeline(object):
def __init__():
self.assembledItems = dict()
def process_item(self, item, spider):
if type(item, PartialAdHousingItem):
self.assembledItems[unique_id] = AssembledHousingItem()
self.assembledItems[unique_id]['field_of_interst'] = ...
...assemble more data
raise DropItem("Assembled it's data")
if type(item, PartialAdImageHousingItem):
self.assembledItems[unique_id]['field_of_interst'] = ...
...assemble more data
raise DropItem("Assembled it's data")
if Fully Assembled:
return self.assembledItems.pop(unique_id)
in need to extract more information from tripAdvisor
my code:
item = TripadvisorItem()
item['url'] = response.url.encode('ascii', errors='ignore')
item['state'] = hxs.xpath('//*[#id="PAGE"]/div[2]/div[1]/ul/li[2]/a/span/text()').extract()[0].encode('ascii', errors='ignore')
if(item['state']==[]):
item['state']=hxs.xpath('//*[#id="HEADING_GROUP"]/div[2]/address/span/span/span[contains(#class,"region_title")][2]/text()').extract()
item['city'] = hxs.select('//*[#id="PAGE"]/div[2]/div[1]/ul/li[3]/a/span/text()').extract()
if(item['city']==[]):
item['city'] =hxs.xpath('//*[#id="HEADING_GROUP"]/div[2]/address/span/span/span[1]/span/text()').extract()
if(item['city']==[]):
item['city']=hxs.xpath('//*[#id="HEADING_GROUP"]/div[2]/address/span/span/span[3]/span/text()').extract()
item['city']= item['city'][0].encode('ascii', errors='ignore')
item['hotelName'] = hxs.xpath('//*[#id="HEADING"]/span[2]/span/a/text()').extract()
item['hotelName']=item['hotelName'][0].encode('ascii', errors='ignore')
reviews = hxs.select('.//div[contains(#id, "review")]')
1. For every hotel in tripAdvisor, there is a id number for the hotel. like 80075 for this hotel: http://www.tripadvisor.com/Hotel_Review-g60763-d80075-Reviews-Amsterdam_Court_Hotel-New_York_City_New_York.html#REVIEWS
how can i extract this id from the TA item?
More information i need for every hotel : shortDescription, stars, zipCode, country and coordinates(long, lat). Can i extract this things?
i need to extract for every review the traveller type. how?
my code for review:
for review in reviews:
it = Review()
it['state'] = item['state']
it['city'] = item['city']
it['hotelName'] = item['hotelName']
it['date'] = review.xpath('.//div[1]/div[2]/div/div[2]/span[2]/#title').extract()
if(it['date']==[]):
it['date']=review.xpath('.//div[1]/div[2]/div/div[2]/span[2]/text()').extract()
if(it['date']!=[]):
it['date']=it['date'][0].encode('ascii', errors='ignore').replace("Reviewed","").strip()
it['userName'] = review.xpath('.//div[contains(#class,"username mo")]/span/text()').extract()
if (it['userName']!=[]):
it['userName']=it['userName'][0].encode('ascii', errors='ignore')
it['userLocation'] = ''.join(review.xpath('.//div[contains(#class,"location")]/text()').extract()).strip().encode('ascii', errors='ignore')
it['reviewTitle'] = review.xpath('.//div[1]/div[2]/div[1]/div[contains(#class,"quote")]/text()').extract()
if(it['reviewTitle']!=[]):
it['reviewTitle']=it['reviewTitle'][0].encode('ascii', errors='ignore')
else:
it['reviewTitle'] = review.xpath('.//div[1]/div[2]/div/div[1]/a/span[contains(#class,"noQuotes")]/text()').extract()
if(it['reviewTitle']!=[]):
it['reviewTitle']=it['reviewTitle'][0].encode('ascii', errors='ignore')
it['reviewContent'] = review.xpath('.//div[1]/div[2]/div[1]/div[3]/p/text()').extract()
if(it['reviewContent']!=[]):
it['reviewContent']=it['reviewContent'][0].encode('ascii', errors='ignore').strip()
it['generalRating'] = review.xpath('.//div/div[2]/div/div[2]/span[1]/img/#alt').extract()
if(it['generalRating']!=[]):
it['generalRating'] =it['generalRating'][0].encode('ascii', errors='ignore').split()[0]
there is a good manual how to find these things? i lost myself with all the spans and the divs..
thanks!
I'll try to do this in purely XPath. Unfortunately, it looks like most of the info you want is contained in <script> tags:
Hotel ID - Returns "80075"
substring-before(normalize-space(substring-after(//script[contains(., "geoId:") and contains(., "lat")]/text(), "locId:")), ",")
Alternatively, the Hotel ID is in the URL, as another answerer mentioned. If you're sure the format will always be the same (such as including a "d" prior to the ID), then you can use that instead.
Rating (the one at the top) - Returns "3.5"
//span[contains(#class, "rating_rr")]/img/#content
There are a couple instances of ratings on this page. The main rating at the top is what I've grabbed here. I haven't tested this within Scrapy, so it's possible that it's popoulated by JavaScript and not initially loaded as part of the HTML. If that's the case, you'll need to grab it somewhere else or use something like Selenium/PhantomJS.
Zip Code - Returns "10019"
(//span[#property="v:postal-code"]/text())[1]
Again, same deal as above. It's in the HTML, but you should check whether it's there upon page load.
Country - Returns ""US""
substring-before(substring-after(//script[contains(., "modelLocaleCountry")]/text(), "modelLocaleCountry = "), ";")
This one comes with quotes. You can always (and you should) use a pipeline to sanitize scraped data to get it to look the way you want.
Coordinates - Returns "40.76174" and "-73.985275", respectively
Lat: substring-before(normalize-space(substring-after(//script[contains(., "geoId:") and contains(., "lat")]/text(), "lat:")), ",")
Lon: substring-before(normalize-space(substring-after(//script[contains(., "geoId:") and contains(., "lat")]/text(), "lng:")), ",")
I'm not entirely sure where the short description exists on this page, so I didn't include that. It's possible you have to navigate elsewhere to get it. I also wasn't 100% sure what the "traveler type" meant, so I'll leave that one up to you.
As far as a manual, it's really about practice. You learn tricks and hacks for working within XPath, and Scrapy allows you to use some added features, such as regex and pipelines. I wouldn't recommend doing the whole "absolute path" XPath (i.e., ./div/div[3]/div[2]/ul/li[3]/...), since any deviation from that within the DOM will completely ruin your scraping. If you have a lot of data to scrape, and you plan on keeping this around a while, your project will become unmanageable very quickly if any site moves around even a single <div>.
I'd recommend more "querying" XPaths, such as //div[contains(#class, "foo")]//a[contains(#href, "detailID")]. Paths like that will make sure that no matter how many elements are placed between the elements you know will be there, and even if multiple target elements are slightly different from each other, you'll be able to grab them consistently.
XPaths are a lot of trial and error. A LOT. Here are a few tools that help me out significantly:
XPath Helper (Chrome extension)
scrapy shell <URL>
scrapy view <URL> (for rendering Scrapy's response in a browser)
PhantomJS (if you're interested in getting data that's been inserted via JavaScript)
Hope some of this helped.
Is it acceptable to get it from the URL using a regex?
id = re.search('(-d)([0-9]+)',url).group(2)
I am trying to crawl pubmed with python and get the pubmed ID for all papers that an article was cited by.
For example this article (ID: 11825149)
http://www.ncbi.nlm.nih.gov/pubmed/11825149
Has a page linking to all articles that cite it:
http://www.ncbi.nlm.nih.gov/pubmed?linkname=pubmed_pubmed_citedin&from_uid=11825149
The problem is it has over 200 links but only shows 20 per page. The 'next page' link is not accessible by url.
Is there a way to open the 'send to' option or view the content on the next pages with python?
How I currently open pubmed pages:
def start(seed):
webpage = urlopen(seed).read()
print webpage
citedByPage = urlopen('http://www.ncbi.nlm.nih.gov/pubmedlinkname=pubmed_pubmed_citedin&from_uid=' + pageid).read()
print citedByPage
From this I can extract all the cited by links on the first page, but how can I extract them from all pages? Thanks.
I was able to get the cited by IDs using a method from this page
http://www.bio-cloud.info/Biopython/en/ch8.html
Back in Section 8.7 we mentioned ELink can be used to search for citations of a given paper. Unfortunately this only covers journals indexed for PubMed Central (doing it for all the journals in PubMed would mean a lot more work for the NIH). Let’s try this for the Biopython PDB parser paper, PubMed ID 14630660:
>>> from Bio import Entrez
>>> Entrez.email = "A.N.Other#example.com"
>>> pmid = "14630660"
>>> results = Entrez.read(Entrez.elink(dbfrom="pubmed", db="pmc",
... LinkName="pubmed_pmc_refs", from_uid=pmid))
>>> pmc_ids = [link["Id"] for link in results[0]["LinkSetDb"][0]["Link"]]
>>> pmc_ids
['2744707', '2705363', '2682512', ..., '1190160']
Great - eleven articles. But why hasn’t the Biopython application note been found (PubMed ID 19304878)? Well, as you might have guessed from the variable names, there are not actually PubMed IDs, but PubMed Central IDs. Our application note is the third citing paper in that list, PMCID 2682512.
So, what if (like me) you’d rather get back a list of PubMed IDs? Well we can call ELink again to translate them. This becomes a two step process, so by now you should expect to use the history feature to accomplish it (Section 8.15).
But first, taking the more straightforward approach of making a second (separate) call to ELink:
>>> results2 = Entrez.read(Entrez.elink(dbfrom="pmc", db="pubmed", LinkName="pmc_pubmed",
... from_uid=",".join(pmc_ids)))
>>> pubmed_ids = [link["Id"] for link in results2[0]["LinkSetDb"][0]["Link"]]
>>> pubmed_ids
['19698094', '19450287', '19304878', ..., '15985178']
This time you can immediately spot the Biopython application note as the third hit (PubMed ID 19304878).
Now, let’s do that all again but with the history …TODO.
And finally, don’t forget to include your own email address in the Entrez calls.