Parsing ajax responses to retrieve final url content in Scrapy? - python

I have the following problem:
My scraper starts at a "base" URL. This page contains a dropdown that creates another dropdown via ajax calls, and this cascades 2-3 times until it has all the information needed to get to the "final" page where the actual content I want to scrape is.
Rather than clicking things (and having to use Selenium or similar) I use the pages exposed JSON API to mimic this behavior, so instead of clicking dropdowns I simply send a request and read JSON responses that contain the array of information used to generate the next dropdown's contents, and do this until I have the final URL for one item. This URL takes me to the final item page that I want to actually parse.
I am confused about how to use Scrapy to get the "final" url for every combination of dropdown boxes. I wrote a crawler using urllib that used a ton of loops to just iterate through every combination of url, but Scrapy seems to be a bit different. I moved away from urllib and lxml because Scrapy seemed like a more maintainable solution, which is easier to integrate with Django projects.
Essentially, I am trying to force Scrapy to take a certain path that I generate along the way as I read the contents of the json responses, and only really parse the last page in the chain to get real content. It needs to do this for every possible page, and I would love to parallelize it so things are efficient (and use Tor, but these are later issues).
I hope I have explained this well, let me know if you have any questions. Thank you so much for your help!
Edit: Added an example
[base url]/?location=120&section=240
returns:
<departments>
<department id="62" abrev="SIG" name="name 1"/>
<department id="63" abrev="ENH" name="name 2"/>
<department id="64" abrev="GGTH" name="name 3"/>
...[more]
</departments>
Then I grab the department id, add it to the url like so:
[base url]/?location=120&section=240&department_id=62
returns:
<courses>
<course id="1" name="name 1"/>
<course id="2" name="name 2"/>
</courses>
This continues until I end up with the actual link to the listing.
This is essentially what this looks like on the page (though in my case, there is a final "submit" button on the form that sends me to the actual listing that I want to parse):
http://roshanbh.com.np/dropdown/
So, I need some way of scraping every combination of the dropdowns so that I get all the possible listing pages. The intermediate step of walking the ajax xml responses to generate final listing URLs is messing me up.

You can use a chain of callback functions starting for the main callback function, say you're implementing a spider extending BaseSpider, write your parse function like this:
...
def parse(self, response):
#other code
yield Request (url=self.baseurl, callback=self.first_dropdown)
def first_dropdown (self, response):
ids=self.parse_first_response() #Code for parsing the first dropdown content
for (i in ids):
req_url=response.url+"/?location="+i
yield Request (url=req_url, callback=self.second_dropdown)
def second_dropdown (self, response):
ids=self.parse_second_response() #Code for parsing the second dropdown contents
url=self.base_url
for (i in ids):
req_url=response.url+"&section="+i
yield Request (url=req_url, callback=self.third_dropdown)
...
the last callback function will have the code needed to extract your data.
Be careful, you're asking to try all possible combinations of input and this can lead you to an high number of requests very fast.

Related

Unable to find a way to paginate through API data

I'm trying to use Python 3 requests.getto retrieve data from this page using its API. I'm interested in retrieving it using the data found here and saving the entire table into my own JSON.
Here's my attempt so far
source = requests.get("https://www.mwebexplorer.com/api/mwebblocks").json()
with open('mweb.json', 'w') as json_file:
json.dump(source, json_file)
I've looked through other questions in regards to pagination and all the other problems are able to write for loops to iterate through all the pages, but in my specific case, the link does not change when clicking next to go to the next page of data. I also can't use scrapy's xpath method to click next due to the entire table and its pagination not being accessible through HTML or XML.
Is there something I can add to my requests.get to include the entire JSON of all pages of the table?
Depending on what browser you're using it might be different, but in chrome I can go to the network tab in devtools and view the full details of the request. This reveals that it's actually a POST request, not a GET request. If you look at the payload, you can see a bunch of key-value pairs, including a start and a length.
So, try something like
requests.post("https://www.mwebexplorer.com/api/mwebblocks", data={"start": "50", "length": "50"})
or similar. You might need to include the other parts of the form data, depending on the response you get.
Keep in mind that sites frequently don't like it when you try to scrape them like this.

Passing URL to parse in Scrapy Spider URL is captured using Scrapy-Selenium

I am trying to scrape a website which has some dropdowns, So I planned to use Scrapy Framework with Scrapy-Selenium(more here) to click around the dropdowns(Nested For loop) and then capture the URL using below code and pass it to the parse() function to look for the needed data and scrape it to MySQL Database.
now_url=self.driver.current_url
print('Current URL is:'+now_url)
yield Request(now_url,callback=self.parse)
def parse(self, response):
#This Function Will Loop though Each Page and Capture the Data Sets Available on Each Page of Medicine
# creating items to be stored in itemspy file with this Crawler:
items=GrxItem()
#loop around the items on each medicine page(from a-z) and add them to items and throw into pipelines to SQL DB
But the logics seems not working as expected. Any insight to deal with this is appriciated. The full code is here.
EDIT:
I tried using SeleniumRequest() as well but it seems that too is not working.

Scrapy identify redirect and stop for loop

I'm trying to iterate over some pages. The different pages of are marked with or10,or20,or30 etc. for the website. i.e.
/Restaurant_Review
is the first page
/Restaurant_Review-or10
Is the second page
/Restaurant_Review-or20
3rd page etc.
The problem is that I get redirected from those sites to the normal url (1st one) if the -or- version doesnt exist. I'm currently looping over a range in a for loop, and dynamically changing the -or- value.
def parse(self,response):
l = range(100)
reviewRange = l[10::10]
for x in reviewRange:
yield((url+"-or"+str(x)), callback=self.parse_page)
def parse_page(self,response):
#do something
#How can I from here tell the for loop to stop
if(oldurl == response.url):
return break
#this doesnt work
The problem is that I need to do the request even if the page doesn't exist, and this is not scalable. I've tried comparing the URLs, but still did not understand how I can return from the parse_page() function something that would tell the parse() function to stop.
You can check what is in response.meta.get('redirect_urls'), for example. In case you have something there, retry original url with dont_filter.
Or try to catch such cases with RetryMiddleware.
This is not an answer to the actual question, but rather an alternative solution that does not require redirect detection.
In the HTML you can already find all those pagination URLs by using:
response.css('.pageNum::attr(href)').getall()
Regarding #Anton's question in a comment about how I got this:
You can check this by opening a random restaurant review page with the Scrapy shell:
scrapy shell "https://www.tripadvisor.co.za/Restaurant_Review-g32655-d348825-Reviews-Brent_s_Delicatessen_Restaurant-Los_Angeles_California.html"
Inside the shell you can view the received HTML in your browser with:
view(response)
There you'll see that it includes the HTML (and that specific class) for the pagination links. The real website does use Javascript to render the next page, but it does so by retrieving the full HTML for the next page based on the URL. Basicallty, it just replaces the entire page, there's very little additional processing involved. So this means if you open the link yourself you get the full HTML too. Hence, the Javascript issue is irrelevant here.

Scrapy - Build URLs Dynamically Based on HTTP Status Code?

I'm just getting started with Scrapy and I went through the tutorial, but I'm running into an issue that either I can't find the answer to in the tutorial and/or docs, or I've read the answer multiple times now, but I'm just not understanding properly...
Scenario:
Let's say I have exactly 1 website that I would like to crawl. Content is rendered dynamically based on query params passed in url. I will need to scrape for 3 "sets" of data based on URL pram of "category".
All the information I need can be grabbed from common base URLs like this:
"http://shop.somesite.com/browse/?product_type=instruments"
And the URls for each category like so:
"http://shop.somesite.com/browse/?q=&product_type=instruments&category=drums"
"http://shop.somesite.com/browse/?q=&product_type=instruments&category=keyboards"
"http://shop.somesite.com/browse/?q=&product_type=instruments&category=guitars"
The one caveat here, is that the site is only loading 30 results per initial request. If the user wants to view more, they have to click on button "Load More Results..." at the bottom. After investigating this a bit, during initial load of page, only the request for top 30 is made (which makes sense), and after clicking the "Load More.." button, the URL is updated with a "pagex=2" appended and the container refreshes with 30 more results. After this, the button goes away and as user continues to scroll down the page, subsequent requests are made to the server to get the next 30 results, "pagex" value is incremented by one, container refreshed with results appended, rinse and repeat.
I'm not exactly sure how to handle pagination on sites, but the simplest solution I came up with is simply finding out what the max number "pagex" for each category, and just set the URLs to that number for starters.
For example, if you pass URL in browser:
"http://shop.somesite.com/browse/?q=&product_type=instruments&category=drums&pagex=22"
HTTP Response Code 200 is received and all results are rendered to page. Great! That gives me what I need!
But, say next week or so, 50 more items added, so now the max is "...pagex=24" I wouldn't get all the latest.
Or is 50 items removed and new max is "...pagex=20", I will get 404 response when requesting "22".
I would like to send a test response with the last known "good" max page number and based on HTTP Response provided, use that to decide what URL will be.
So, before I start any crawling, I would like to add 1 to "pagex" and check for 404. If 404 I know I'm still good, if I get 200, I need to keep adding 1 until I get 404, so I know where max is (or decrease if needed).
I can't seem to figure out if I can do this using Scrapy, of I have to use a different module to run this check first. I tried adding simple checks for testing purposes in the "parse" and "start_requests" methods, and no luck. start_requests doesn't seem to be able to handle responses and parse can check the response code, but will not update the URL as instructed.
I'm sure it's my poor coding skills (still new to this all), but I can't seem to find a viable solution....
Any thoughts or ideas are very much appreciated!
you can configure in scrapy which statuses to configure, that way you can make decisions for example in the parse method according to the response.status. Check how to handle statuses in the documentation. Example:
class MySpider(CrawlSpider):
handle_httpstatus_list = [404]

Scrapy: Get data on page and following link

I have been using scrapy for a personal project. My problem is very similar to the question asked on the following page:
Scrapy: Follow link to get additional Item data?
The page I am scraping is the following:
http://www.tennisinsight.com/player_activity.php?player_id=51
This page has a list of matches in this form for eg:
Round of 16 Def. Ivan Dodig(37,2.41) (CRO) 6-3 6-3 Recap Match Stats $1.043
I have currently written in scrapy code that opens every link on the page which has the "Match Stats" link, and scrapes data on that page into an individual record
In addition to this, I want to scrape the "Odds" column (which is the $1.043 above) and add this data to the record.
I have searched for an answer and it seems that I have to use the Request meta field and pass this data along to the parse method. However, I have a problem because I am struggling to incorporate it into my code. The answer from the stackoverflow link I linked above is "To scrape additional fields which are on other pages, in a parse method extract URL of the page with additional info, create and return from that parse method a Request object with that URL and pass already extracted data via its meta parameter."
This makes perfect sense, however, the URLs that I scrape are in the rules, so I dont know how to extract the required data.
Here is part of my code so far which will hopefully better explain my problem.
rules = (
Rule(SgmlLinkExtractor(allow=r"match_stats_popup.php\?matchID=\d+",
restrict_xpaths='//td[#class="matchStyle"]',
tags='a', attrs='href', process_value=getPopupLink), callback='parse_match', follow=True)
The parse_match function parses the match stats into one item.
So what happens is that each of these match stats links are opened up, and there is no way for me to access the main page's Odds column.
Any help will be much appreciated.

Categories

Resources