Set-up
I'm scraping housing ads with scrapy: per housing ad I scrape several housing characteristics.
Scraping the housing characteristics works fine.
Problem
Besides the housing characteristics, I want to scrape one image per ad.
I have the following code:
class ApartmentSpider(scrapy.Spider):
name = 'apartments'
start_urls = [
'http://www.jaap.nl/huurhuizen/noord+holland/groot-amsterdam/amsterdam'
]
def parse(self, response):
for href in response.xpath(
'//*[#id]/a',
).css("a.property-inner::attr(href)").extract():
yield scrapy.Request(response.urljoin(href),
callback=self.parse_ad) # parse_ad() scrapes housing characteristics
yield scrapy.Request(response.urljoin(href),
callback=self.parse_AdImage) # parse_AdImage() obtains one image per ad
So, I've got two yield commands, which does not work. That is, I get the characteristics, but not the images.
I can comment the first one, such that I get the images.
How do I fix this such that I get both? Thanks in advance.
Just yield them both together.
yield (scrapy.Request(response.urljoin(href), callback=self.parse_ad), scrapy.Request(response.urljoin(href), callback=self.parse_AdImage))
On the receiving end, grab both as separate values
characteristics, image = ApartmentSpider.parse(response)
I have two major suggestions:
Number 1
I would strongly suggest re-working your code to actually farm out all the info at the same time. Instead of having two separate parse_X functions...just have one that gets the info and returns a single item.
Number 2
Implement a Spider Middleware that does merging/splitting similar to what I have below for pipelines. A simple example middleware is https://github.com/scrapy/scrapy/blob/ebef6d7c6dd8922210db8a4a44f48fe27ee0cd16/scrapy/spidermiddlewares/urllength.py. You would simply merge items and track them here before they enter the itempipelines.
WARNING DO NOT DO WHAT's BELOW. I WAS GOING TO SUGGEST THIS, AND THE CODE MIGHT WORK...BUT WITH SOME POTENTIALLY HIDDEN ISSUES.
IT IS HERE FOR COMPLETENESS OF WHAT I WAS RESEARCHING -- IT IS RECOMMENDED AGAINST HERE:https://github.com/scrapy/scrapy/issues/1915
Use the item processing pipelines in scrapy. They are incredibly useful for accumulating data. Have a item joiner pipeline who's purpose is to wait for the two separate partial data items and concatenate them into one item and key them on the ad id (or some other unique piece of data).
In rough not-runnable psuedocode:
class HousingItemPipeline(object):
def __init__():
self.assembledItems = dict()
def process_item(self, item, spider):
if type(item, PartialAdHousingItem):
self.assembledItems[unique_id] = AssembledHousingItem()
self.assembledItems[unique_id]['field_of_interst'] = ...
...assemble more data
raise DropItem("Assembled it's data")
if type(item, PartialAdImageHousingItem):
self.assembledItems[unique_id]['field_of_interst'] = ...
...assemble more data
raise DropItem("Assembled it's data")
if Fully Assembled:
return self.assembledItems.pop(unique_id)
Related
I'm working on a divvy dataset project.
I want to scrape information for each suggestion location and comments provided from here http://suggest.divvybikes.com/.
Am I able to scrape this information from Mapbox? It is displayed on a map so it must have the information somewhere.
I visited the page, and logged my network traffic using Google Chrome's Developer Tools. Filtering the requests to view only XHR (XmlHttpRequest) requests, I saw a lot of HTTP GET requests to various REST APIs. These REST APIs return JSON, which is ideal. Only two of these APIs seem to be relevant for your purposes - one is for places, the other for comments associated with those places. The places API's JSON contains interesting information, such as place ids and coordinates. The comments API's JSON contains all comments regarding a specific place, identified by its id. Mimicking those calls is pretty straightforward with the third-party requests module. Fortunately, the APIs don't seem to care about request headers. The query-string parameters (the params dictionary) need to be well-formulated though, of course.
I was able to come up with the following two functions: get_places makes multiple calls to the same API, each time with a different page query-string parameter. It seems that "page" is the term they use internally to split up all their data into different chunks - all the different places/features/stations are split up across multiple pages, and you can only get one page per API call. The while-loop accumulates all places in a giant list, and it keeps going until we receive a response which tells us there are no more pages. Once the loop ends, we return the list of places.
The other function is get_comments, which takes a place id (string) as a parameter. It then makes an HTTP GET request to the appropriate API, and returns a list of comments for that place. This list may be empty if there are no comments.
def get_places():
import requests
from itertools import count
api_url = "http://suggest.divvybikes.com/api/places"
page_counter = count(1)
places = []
for page_nr in page_counter:
params = {
"page": str(page_nr),
"include_submissions": "true"
}
response = requests.get(api_url, params=params)
response.raise_for_status()
content = response.json()
places.extend(content["features"])
if content["metadata"]["next"] is None:
break
return places
def get_comments(place_id):
import requests
api_url = "http://suggest.divvybikes.com/api/places/{}/comments".format(place_id)
response = requests.get(api_url)
response.raise_for_status()
return response.json()["results"]
def main():
from operator import itemgetter
places = get_places()
place_id = places[12]["id"]
print("Printing comments for the thirteenth place (id: {})\n".format(place_id))
for comment in map(itemgetter("comment"), get_comments(place_id)):
print(comment)
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
Printing comments for the thirteenth place (id: 107062)
I contacted Divvy about this five years ago and would like to pick the conversation back up! The Evanston Divvy bikes are regularly spotted in Wilmette and we'd love to expand the system for riders. We could easily have four stations - at the Metra Train Station, and the CTA station, at the lakefront Gillson Park and possibly one at Edens Plaza in west Wilmette. Please, please, please contact me directly. Thanks.
>>>
For this example, I'm printing all the comments for the 13th place in our list of places. I picked that one because it is the first place which actually has comments (0 - 11 didn't have any comments, most places don't seem to have comments). In this case, this place only had one comment.
EDIT - If you wanted to save the place ids, latitude, longitude and comments in a CSV, you can try changing the main function to:
def main():
import csv
print("Getting places...")
places = get_places()
print("Got all places.")
fieldnames = ["place id", "latitude", "longitude", "comments"]
print("Writing to CSV file...")
with open("output.csv", "w") as file:
writer = csv.DictWriter(file, fieldnames)
writer.writeheader()
num_places_to_write = 25
for place_nr, place in enumerate(places[:num_places_to_write], start=1):
print("Writing place #{}/{} with id {}".format(place_nr, num_places_to_write, place["id"]))
writer.writerow(dict(zip(fieldnames, [place["id"], *place["geometry"]["coordinates"], [c["comment"] for c in get_comments(place["id"])]])))
return 0
With this, I got results like:
place id,latitude,longitude,comments
107098,-87.6711076553,41.9718155716,[]
107097,-87.759540081,42.0121073671,[]
107096,-87.747695446,42.0263916146,[]
107090,-87.6642036438,42.0162096564,[]
107089,-87.6609444613,41.8852953922,[]
107083,-87.6007853815,41.8199433342,[]
107082,-87.6355862613,41.8532736671,[]
107075,-87.6210737228,41.8862644836,[]
107074,-87.6210737228,41.8862644836,[]
107073,-87.6210737228,41.8862644836,[]
107065,-87.6499611139,41.9627251578,[]
107064,-87.6136027649,41.8332984674,[]
107062,-87.7073025402,42.0760990584,"[""I contacted Divvy about this five years ago and would like to pick the conversation back up! The Evanston Divvy bikes are regularly spotted in Wilmette and we'd love to expand the system for riders. We could easily have four stations - at the Metra Train Station, and the CTA station, at the lakefront Gillson Park and possibly one at Edens Plaza in west Wilmette. Please, please, please contact me directly. Thanks.""]"
In this case, I used the list-slicing syntax (places[:num_places_to_write]) to only write the first 25 places to the CSV file, just for demonstration purposes. However, after about the first thirteen were written, I got this exception message:
A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
So, I'm guessing that the comment-API doesn't expect to receive so many requests in such a short amount of time. You may have to sleep in the loop for a bit to get around this. It's also possible that the API doesn't care, and just happened to timeout.
In scrapinghub how can we achieve multiple items exporting?
I have MainItem() and a SubItem() item classes and I would like to get two separate items in scrapinghub item's page.
I can do this by implementing different item pipelines for both
items in a normal crawling but how this can be achieved in
scrapinghub? As of now, I'm getting only MainItem() objects in
items page.
sample code snippet given below,
def parse_data(self, response):
.
.
.
# main item fields
m_item = MainItem()
m_item['a'] = 'A'
m_item['b'] = 'B'
yield m_item
# sub item fields
s_item = SubItem()
s_item['c'] = 'C'
s_item['d'] = 'D'
yield s_item
Here in scrapinghub I'm able to view only MainItems() fields
Can you provide more information? The spider code and logs, I can't see any problem with your example.
Scrapy Cloud does allow a spider to yield different items. These items can be filtered later using the SC interface.
I've been working on something that I think should be relatively easy but I keep hitting my head against a wall. I've tried multiple similar solutions from stackoverflow and I've improved my code but still stuck on the basic functionality.
I am scraping a web page that returns an element (genre) that is essential a list of genres:
Mystery, Comedy, Horror, Drama
The xpath returns perfectly. I'm using a Scrapy pipeline to output to a CSV file. What I'd like to do is create a separate row for each item in the above list along with the page url:
"Mystery", "http:domain.com/page1.html"
"Comedy", "http:domain.com/page1.html"
No matter what I try I can only output:
"Mystery, Comedy, Horror, Drama", ""http:domain.com/page1.html"
Here's my code:
def parse_genre (self, response):
for item in [i.split (',') for i in response.xpath ('//span [contains (#class, "genre")]/text()').extract()]:
sg = ItemLoader (item=ItemGenre (), response=response)
sg.add_value ('url', response.url)
sg.add_value ('genre', item, MapCompose(str.strip))
yield sg.load_item ()
This is called from the main parse routine for the spider. That all functions correctly. (I have two items on each web page. The main spider gathers the "parent" information and this function is attempting to gather "child" information. Technically not a child record, but definitely a 1 to many relationship.)
I've tried a number of possible solutions. This is the only version that makes sense to me and seems like it should work. I'm sure I'm just not splitting the genre string correctly.
You are very close.
Your culprit seems to be the way you are getting your items:
[i.split(',') for i in response.xpath('//span[contains(#class, "genre")]/text()').extract()]
Without having the source I can't correct you fully but it is obvious here your code is returning a list of lists.
You should either flatten this list of lists into list of strings or iterate through it appropriately:
items = response.xpath('//span[contains (#class, "genre")]/text()').extract()]
for item in items:
for category in item.split(','):
sg = ItemLoader(item=ItemGenre(), response=response)
sg.add_value('url', response.url)
sg.add_value('genre', category, MapCompose(str.strip))
yield sg.load_item ()
Alternative more advance technique would be to use list nested comprehension:
items = response.xpath('//span[contains (#class, "genre")]/text()').extract()]
# good cheatsheet to remember this [leaf for tree in forest for leaf in tree]
categories = [cat for item in items for cat in items]
for category in categories:
sg = ItemLoader(item=ItemGenre(), response=response)
sg.add_value('url', response.url)
sg.add_value('genre', category, MapCompose(str.strip))
yield sg.load_item ()
I am totally new to programming but I come across with this bizarre phenomenon that I could not answer. Please point me in the right direction. I started crawling a entirely javascript built webpage with ajax tables. First I started with selenium and it worked well. However, I noticed that someone of you pros here mentioned scrapy is much faster. Then I tried and succeeded in building the crawler under scrapy, with a hell of headache.
I need to use re to extract the javascript strings and what happened next confused me. Here is what the python doc says(https://docs.python.org/3/library/re.html):
"but using re.compile() and saving the resulting regular expression object for reuse is more efficient when the expression will be used several times in a single program."
I first started with using re.search(pattern,str) in a loop. Scrapy crawled 202 pages in 3 seconds (finish time - start time).
Then I followed the Python doc's suggestion, compiling re.compile(pattern) before the loop to improve efficiency. Scrapy crawled the same 202 pages with 37 seconds. What is going on here?
Here is the some of the code, other suggestions to improve the code is greatly appreciated. Thanks.
EDIT2: I was too presumptuous to base my view on a single run.
Later 3 tests with 2000 webpages show that regex compiling within the loop is finished in 25s on average. With the same 2000 webpages regex compiling before the loop is finished in 24s on average.
EDIT:
Here is the webpage I am trying to crawl
http://trac.syr.edu/phptools/immigration/detainhistory/
I am trying to crawl basically everything on a year-month basis from this database. There are three javascript generated columns on this webpage. When you select options form these drop-down menus, the webpage send back sql queries to its server and generate corresponding contents. I figured out directly generating these pre-defined queries to crawl all the table contents, however, it is great pain.
def parseYM(self, response):
c_list =response.xpath()
c_list_num = len(c_list)
item2 = response.meta['item']
# compiling before loop
# search_c=re.comple(pattern1)
# search_cid=re.comple(pattern2)
for j in range(c_list):
item = myItem()
item[1] = item2[1]
item['id'] = item2['id']
ym_id = item['ymid']
item[3] = re.search(pattern1, c_list[j]).group(1)
tmp1 = re.search(pattern2, c_list[j]).group(1)
# item['3'] = search_c.search(c_list[j]).group(1)
# tmp1 = search_cid.search(c_list[j]).group(1)
item[4] = tmp1
link1 = 'link1'
request = Request(link1, self.parse2, meta={'item': item}, dont_filter=True)
yield request
Unnecessary temp variable is used to avoid long lines. Maybe there are better ways? I have a feeling that the regular expression issue has something to do with the twisted reactor. The twisted doc is quite intimidating to newbies like me...
I have to crawl the following url, which basically contains reviews. all the reviews there, has a review writer name, title for a review, and a review itself.
I've chosen "python-scrapy" to do this task.
But the url mentioned is not the start url, I will be obtaining it from the basic parse method. And in parse I will initialize a itemLoder. I will extract few items there and pass the items via meta of the response. (the extracted field contains information such as hotel name, Address, pricing etc....)
I have also declared items, namely "review_member_name", "review_quote", "review_review"
There are more than one review in the page and the review id for an review can be obtained from the response.url. (shown in parse method below)
since there are many reviews and all will share the same item name, the items get concatenated which should not happen. Can anybody suggest me a way to solve this?
below is my source for parse_review.
def parse_review(self,response):
review_nos = re.search(".*www\.tripadvisor\.in/ExpandedUserReviews-.*context=1&reviews=(.+)&servlet=Hotel_Review&expand=1",response.url).group(1)
review_nos = review_nos.split(',') # list of review ids
for review_no in review_nos:
item = response.meta['item']
#item = ItemLoader(item=TripadvisorItem(), response=response) - this works fine but I will lose the items from parse method
div_id = "expanded_review_"+review_no
review = response.xpath('/html/body/div[#id="%s"]'%div_id)
member_name = review.xpath('.//div[#class="member_info"]//div[#class="username mo"]//text()').extract()
if member_name:
item.add_value('review_member_name', member_name)
review_quote = review.xpath('.//div[#class="innerBubble"]/div[#class="quote"]//text()').extract()
if review_quote:
item.add_value('review_quote', review_quote)
review_entry = review.xpath('.//div[#class="innerBubble"]/div[#class="entry"]//text()').extract()
if review_entry:
item.add_value('review_review', review_entry)
yield item.load_item()
following is my items.json ("review_review" is being removed and the items from parse method too is removed)
[{"review_quote": "\u201c Fabulous service \u201d", "review_member_name": "VimalPrakash"},
{"review_quote": "\u201c Fabulous service \u201d \u201c Indian hospitality at its best, and honestly the best coffee in India \u201d", "review_member_name": "VimalPrakash Jessica P"},
{"review_quote": "\u201c Fabulous service \u201d \u201c Indian hospitality at its best, and honestly the best coffee in India \u201d \u201c Nice hotel in a central location \u201d", "review_member_name": "VimalPrakash Jessica P VikInd"}]
And please suggest a good title for this question.
You'll have to create a new ItemLoader before doing add_value on it; now you're creating one item, and adding new values to it again and again in the loop.
for review_no in review_nos:
item = ItemLoader(item=projectItem(), response=response)
...
yield item.load_item()
You can also use .add_xpath directly with the xpath you're supplying, and use response.xpath as the selector for the item when creating the ItemLoader, that way you can probably avoid all the if tests and let the load do what it should do: load items.