I'm just getting started with Scrapy and I went through the tutorial, but I'm running into an issue that either I can't find the answer to in the tutorial and/or docs, or I've read the answer multiple times now, but I'm just not understanding properly...
Scenario:
Let's say I have exactly 1 website that I would like to crawl. Content is rendered dynamically based on query params passed in url. I will need to scrape for 3 "sets" of data based on URL pram of "category".
All the information I need can be grabbed from common base URLs like this:
"http://shop.somesite.com/browse/?product_type=instruments"
And the URls for each category like so:
"http://shop.somesite.com/browse/?q=&product_type=instruments&category=drums"
"http://shop.somesite.com/browse/?q=&product_type=instruments&category=keyboards"
"http://shop.somesite.com/browse/?q=&product_type=instruments&category=guitars"
The one caveat here, is that the site is only loading 30 results per initial request. If the user wants to view more, they have to click on button "Load More Results..." at the bottom. After investigating this a bit, during initial load of page, only the request for top 30 is made (which makes sense), and after clicking the "Load More.." button, the URL is updated with a "pagex=2" appended and the container refreshes with 30 more results. After this, the button goes away and as user continues to scroll down the page, subsequent requests are made to the server to get the next 30 results, "pagex" value is incremented by one, container refreshed with results appended, rinse and repeat.
I'm not exactly sure how to handle pagination on sites, but the simplest solution I came up with is simply finding out what the max number "pagex" for each category, and just set the URLs to that number for starters.
For example, if you pass URL in browser:
"http://shop.somesite.com/browse/?q=&product_type=instruments&category=drums&pagex=22"
HTTP Response Code 200 is received and all results are rendered to page. Great! That gives me what I need!
But, say next week or so, 50 more items added, so now the max is "...pagex=24" I wouldn't get all the latest.
Or is 50 items removed and new max is "...pagex=20", I will get 404 response when requesting "22".
I would like to send a test response with the last known "good" max page number and based on HTTP Response provided, use that to decide what URL will be.
So, before I start any crawling, I would like to add 1 to "pagex" and check for 404. If 404 I know I'm still good, if I get 200, I need to keep adding 1 until I get 404, so I know where max is (or decrease if needed).
I can't seem to figure out if I can do this using Scrapy, of I have to use a different module to run this check first. I tried adding simple checks for testing purposes in the "parse" and "start_requests" methods, and no luck. start_requests doesn't seem to be able to handle responses and parse can check the response code, but will not update the URL as instructed.
I'm sure it's my poor coding skills (still new to this all), but I can't seem to find a viable solution....
Any thoughts or ideas are very much appreciated!
you can configure in scrapy which statuses to configure, that way you can make decisions for example in the parse method according to the response.status. Check how to handle statuses in the documentation. Example:
class MySpider(CrawlSpider):
handle_httpstatus_list = [404]
Related
I need your help because I have for the first time problems to get some information with Beautifulsoup .
I have two problems on this page
The green button GET COUPON CODE appear after a few moment see GIF capture
When we inspect the button link, we find a a simple href attribute that call to an out.php function that performs the opening of the destination link that I am trying to capture.
GET COUPON CODE
Thank you for your help
Your problem is a little unclear but if I understand correctly, your first problem is that the 'get coupon code' button looks like this when you render the HTML that you receive from the original page request.
The mark-up for a lot of this code is rendered dynamically using javascript. So that button is missing its href value until it gets loaded in later. You would need to also run the javascript on that page to render this after the initial request. You can't really get this easily using just the python requests library and BeautifulSoup. It will be a lot easier if you use Selenium too which lets you control a browser so it runs all that javascript for you and then you can just get the button info a couple of seconds after loading the page.
There is a way to do this all with plain requests, but it's a bit tedious. You would need to read through the requests the page makes and figure out which one gets the link for the button. The upside to this is it would cut the number of steps to get the info you need and the amount of time it takes to get. You could just use this new request every time to get the right PHP link then just get the info from there.
For your second point, I'm also not sure if I answered it already, but maybe you're also trying to get the redirect link from that PHP link. From inspecting the network requests, it looks like the info will be found in the response headers, there is no body to inspect.
(I know it says 'from cache' but the point is that the redirect is being caused by the header info)
Web-scraping adjacent question about URLs acting whacky.
If I go to glassdoor job search and enter in 6 fields (Austin, "engineering manager", fulltime, exact city, etc.. ). I get a results page with 38 results. This is the link I get. Ideally I'd like to save this link with its search criteria and reference it later.
https://www.glassdoor.com/Job/jobs.htm?sc.generalKeyword=%22engineering+manager%22&sc.locationSeoString=austin&locId=1139761&locT=C?jobType=fulltime&fromAge=30&radius=0&minRating=4.00
However, If I copy that exact link and paste it into a new tab, it doesn't act as desired.
It redirects to this different link, maintaining some of the criteria but losing the location criteria, bringing up thousands of results from around the country instead of just Austin.
https://www.glassdoor.com/Job/jobs.htm?sc.generalKeyword=%22engineering+manager%22&fromAge=30&radius=0&minRating=4.0
I understand I could use selenium to select all 6 fields, I'd just like to understand what's going on here and know if there is a solution involving just using a URL.
The change of URL seems to happen on the server that is handling the request. I would think this is how it's configured on the server-side endpoint for it to trim out extra parameters and redirects you to another URL. There's nothing you can do about this since however you pass it, it will always resolve into the second URL format.
I have also tried URL shortener but the same behavior persists.
The only way around this is to use Automation such as Selenium to enable the same behaviour to select and display the results from the first URL.
I'm trying to iterate over some pages. The different pages of are marked with or10,or20,or30 etc. for the website. i.e.
/Restaurant_Review
is the first page
/Restaurant_Review-or10
Is the second page
/Restaurant_Review-or20
3rd page etc.
The problem is that I get redirected from those sites to the normal url (1st one) if the -or- version doesnt exist. I'm currently looping over a range in a for loop, and dynamically changing the -or- value.
def parse(self,response):
l = range(100)
reviewRange = l[10::10]
for x in reviewRange:
yield((url+"-or"+str(x)), callback=self.parse_page)
def parse_page(self,response):
#do something
#How can I from here tell the for loop to stop
if(oldurl == response.url):
return break
#this doesnt work
The problem is that I need to do the request even if the page doesn't exist, and this is not scalable. I've tried comparing the URLs, but still did not understand how I can return from the parse_page() function something that would tell the parse() function to stop.
You can check what is in response.meta.get('redirect_urls'), for example. In case you have something there, retry original url with dont_filter.
Or try to catch such cases with RetryMiddleware.
This is not an answer to the actual question, but rather an alternative solution that does not require redirect detection.
In the HTML you can already find all those pagination URLs by using:
response.css('.pageNum::attr(href)').getall()
Regarding #Anton's question in a comment about how I got this:
You can check this by opening a random restaurant review page with the Scrapy shell:
scrapy shell "https://www.tripadvisor.co.za/Restaurant_Review-g32655-d348825-Reviews-Brent_s_Delicatessen_Restaurant-Los_Angeles_California.html"
Inside the shell you can view the received HTML in your browser with:
view(response)
There you'll see that it includes the HTML (and that specific class) for the pagination links. The real website does use Javascript to render the next page, but it does so by retrieving the full HTML for the next page based on the URL. Basicallty, it just replaces the entire page, there's very little additional processing involved. So this means if you open the link yourself you get the full HTML too. Hence, the Javascript issue is irrelevant here.
I have a search engine in production serving around 700 000 url. The crawling is done using Scrapy, and all spiders are scheduled using DeltaFetch in order to get daily new links.
The difficulty I'm facing is handling broken links.
I have a hard time finding a good way to periodically scan, and remove broken links. I was thinking about a few solutions :
Developing a python script using requests.get, to check on every single url, and delete anything that returns a 404 status.
Using a third party tool like https://github.com/linkchecker/linkchecker, but not sure if it's the best solution since I only need to check up a list of url, not a website.
Using a scrapy spider to scrap this url list, and return any urls that are erroring out. I'm not really confident on that one since I know scrapy tends to timeout when scaning a lot of urls on different domains, this is why I rely so much on deltafetch
Do you have any recommendations / best practice to solve this problem?
Thanks a lot.
Edit : I forgot to give one precision : I'm looking to "validate" those 700k urls, not to crawl them. actually those 700k urls are the crawling result of around 2500k domains.
You could write a small script that just check the return http status like so:
for url in urls:
try:
urllib2.urlopen(url)
except urllib2.HTTPError, e:
# Do something when request fails
print e.code
This would be the same as your first point. You could also run this async in order to optimize the time it takes to run through your 700k links.
I would suggest using scrapy, since you're already looking up each URL with this tool and thus knows which URLs errors out. This means you don't have to check the URLs a second time.
I'd go about it like this:
Save every URL erroring out in a separate list/map with a counter (which is stored between runs).
Every time an URL errors out, increment the counter. If it doesn't, decrement the counter.
After running the Scrapy script, check this list/map for URLs with a high enough counter - let's say more than 10 faults, and remove them - or store them in a seperate list of links to check up on a later time (As a check if you accidentally removed a working URL because a server was down too long).
Since your third bullet is concerned about Scrapy being shaky with URL results, the same could be said for websites in general. If a site errors out on 1 try, it might not mean a broken link.
If you go for creating a script of our own check this solution
In addition an optimization that I suggest is to make heirarchy in your URL repository. If you get 404 from one of a parent URL you can avoid checking all it children URLs
First thought came into my mind is to request URLs with HEAD instead of any other method
Spawn multiple spiders at once assigning them batches like LIMIT 0,10000 and LIMIT 10000,10000
In your data pipeline, instead of running a MySQL DELETE query each time scraper finds 404 status, run DELETE FROM table WHERE link IN(link1,link2) query in bulk
I am sure you have INDEX on link column, if not add it
If you visit this link right now, you will probably get a VBScript error.
On the other hand, if you visit this link first and then the above link (in the same session), the page comes through.
The way this application is set up, the first page is meant to serve as a frame in the second (main) page. If you click around a bit, you'll see how it works.
My question: How do I scrape the first page with Python? I've tried everything I can think of -- urllib, urllib2, mechanize -- and all I get is 500 errors or timeouts.
I suspect the answers lies with mechanize, but my mechanize-fu isn't good enough to crack this. Can anyone help?
It always comes down to the request/response model. You just have to craft a series of http requests such that you get the desired responses. In this case, you also need the server to treat each request as part of the same session. To do that, you need to figure out how the server is tracking sessions. It could be a number of things, from cookies to hidden inputs to form actions, post data, or query strings. If I had to guess I'd put my money on a cookie in this case (I haven't checked the links). If this holds true, you need to send the first request, save the cookie you get back, and then send that cookie along with the 2nd request.
It could also be that the initial page will have buttons and links that get you to the second page. Those links will have something like <A href="http://cad.chp.ca.gov/iiqr.asp?Center=RDCC&LogNumber=0197D0820&t=Traffic%20Hazard&l=3358%20MYRTLE&b="> where a lot of the gobbedlygook is generated by the first page.
The "Center=RDCC&LogNumber=0197D0820&t=Traffic%20Hazard&l=3358%20MYRTLE&b=" part encodes some session information that you must get from the first page.
And, of course, you might even need to do both.
You might also try BeautifulSoup in addition to Mechanize. I'm not positive, but you should be able to parse the DOM down into the framed page.
I also find Tamper Data to be a rather useful plugin when I'm writing scrapers.