How to request a portion of a webpage scrapy (python) - python

I am a bit new to web scraping and my question might be a bit silly. I want to get information from a rental website. I want to scrape almost 2000 pages per day to obtain the information. But I do not want to hammer their website. I just need information inside a specific tag which is a table. Is there any ways to only request that part of the page rather than getting the whole page?
I will surely add delay and sleep to the script, but reducing file size would also help.
Implementing that will reduce the requested file size from around 300kB to 11kB.
Website URL: https://asunnot.oikotie.fi/vuokrattavat-asunnot
example of webpage: https://asunnot.oikotie.fi/vuokrattavat-asunnot/imatra/15733776
required tag: <div class="listing-details-container">...</div>
Thank you for your response in advance :)

I think 2000 a day is not high - depends when you do it. If you put a 10 second wait between each request that should not overload it - but would take 6 hours.
It may be better to do it overnight when the site should be quieter.
If you do 2000 with no wait the site owner may be unhappy.

Related

Impossible to recover some information with Beautifulsoup on a site

I need your help because I have for the first time problems to get some information with Beautifulsoup .
I have two problems on this page
The green button GET COUPON CODE appear after a few moment see GIF capture
When we inspect the button link, we find a a simple href attribute that call to an out.php function that performs the opening of the destination link that I am trying to capture.
GET COUPON CODE
Thank you for your help
Your problem is a little unclear but if I understand correctly, your first problem is that the 'get coupon code' button looks like this when you render the HTML that you receive from the original page request.
The mark-up for a lot of this code is rendered dynamically using javascript. So that button is missing its href value until it gets loaded in later. You would need to also run the javascript on that page to render this after the initial request. You can't really get this easily using just the python requests library and BeautifulSoup. It will be a lot easier if you use Selenium too which lets you control a browser so it runs all that javascript for you and then you can just get the button info a couple of seconds after loading the page.
There is a way to do this all with plain requests, but it's a bit tedious. You would need to read through the requests the page makes and figure out which one gets the link for the button. The upside to this is it would cut the number of steps to get the info you need and the amount of time it takes to get. You could just use this new request every time to get the right PHP link then just get the info from there.
For your second point, I'm also not sure if I answered it already, but maybe you're also trying to get the redirect link from that PHP link. From inspecting the network requests, it looks like the info will be found in the response headers, there is no body to inspect.
(I know it says 'from cache' but the point is that the redirect is being caused by the header info)

scraping text rendered into a svg graphic (to deter scrapers) - how to?

So this time in my scraping escapades I've encountered a new foe - a website which deters scrapers by "transforming" the price data everyone would like to scrape into SVG images. A simple question - what is the "preferred" tool or method of scraping such a site continously? I thought of downloading full page screenshots with Selenium (with stealth, since the site also has cloudflare scrape detection) and OCR'ing it with tesseract but downloading alone takes about 7 seconds per page (and I have 180 of them to scrape) so while that isn't completely unworkable, it is below expectations, so to speak.
My question is, what are the general methods, techniques or tools I should be looking at to tackle this task? Is there a way of OCR'ing the SVGs directly on the site without having to download them somehow/making screenshots? Or what should I be looking at?
for reference, what I'm trying to scrape is for example this - https://www.goatbots.com/set/kaldheim , the "buy" and "sell" columns
You could try taking the screenshots of the price elements only instead of taking complete page screenshot. Check this post for partial screenshots
As for OCR'ing it with tesseract is the best free option.
For cloudflare use chrome undetected driver for python which is very much successful in bypassing cloudflare.

Is there a way to monitor a page 24/7 and when there is an update, load the new content

I would like to load/check the new content that has been loaded to a section of a page. Some pages update all the time, but the section that I want updates only once in couple hours or minute. Although, no one knows when there will be the new content uploaded to that section. This can happen 24/7. What I want to accomplish is whenever there is a new content upload to that section, do something immediately(in this case, go into the link and load the page). The only thing I can think of as of now is checking that section of the page as frequent as possible, ie. every 30 seconds, every minute. However, there are thousands of pages(~6000 roughly) that I want to check on. I don't think this is an ideal way to do, let alone if that's possible for the frequency I want.
I'm just wondering if there is a way to do it without asking my bot to scrape every single minute for each page?
Nope, there is no magic spell here. Web pages do not have a "notification" option. If you want the info, you'll need to poll for the info. Yes, it's going to be wasteful, which is why you should ask yourself why you are doing this.

How to optimize web site / make it load faster?

I have a webpage which does web scraping and displays news in a slideshow. It also extracts tweets from Twitter using tweepy.
The code sequence is below:
class extract_news:
def bcnews(self):
//code to extract news
def func2(self):
//code to extract news
...
...
def extractfromtwitter(self):
//code to extract using tweepy
I have multiple such functions to extract from different websites using BS4 and to display the news and tweets. I am using Flask to run this code.
But the page takes about 20seconds or so to load. And if someone tries to access it remotely, it takes too long and the browser gives the error "Connection Timed Out" or just doesn't load.
How can I make this page load faster? Say in like >5 seconds.
Thanks!
You need to identify the bottlenecks in your code and then figure out how to reduce them. It's difficult to help you with the minimal amount of code that you have provided, but the most likely cause is that each HTTP request takes most of the time, and the parsing is probably negligible in comparison.
See if you can figure out a way to paralleise the HTTP requests, e.g. using the multiprocessing or threading modules.
I agree with the others. To give a concrete answer/solution we will NEED to see the code.
However in a nutshell what you will need to do is profile the application with your DevTools. This will result in you pushing the sync javascript code below the CSS, markup, and ASCII loading.
Also create a routine to load an initial chunk of content (approximately one page or slide worth) so that the user will have something to look at. The rest can load in the background and they will never know the difference. It will almost certainly be available before they are able to click to scroll to the next slide. Even if it does take 10 seconds or so.
Perceived performance is what I am describing here. Yes I agree , you will and should find ways to improve the overall loading. However arguably more important is improving the "perceived performance". This is done (as I said), by loading some initial content. Then streaming in the rest immediately afterwards.

Scrapy - Build URLs Dynamically Based on HTTP Status Code?

I'm just getting started with Scrapy and I went through the tutorial, but I'm running into an issue that either I can't find the answer to in the tutorial and/or docs, or I've read the answer multiple times now, but I'm just not understanding properly...
Scenario:
Let's say I have exactly 1 website that I would like to crawl. Content is rendered dynamically based on query params passed in url. I will need to scrape for 3 "sets" of data based on URL pram of "category".
All the information I need can be grabbed from common base URLs like this:
"http://shop.somesite.com/browse/?product_type=instruments"
And the URls for each category like so:
"http://shop.somesite.com/browse/?q=&product_type=instruments&category=drums"
"http://shop.somesite.com/browse/?q=&product_type=instruments&category=keyboards"
"http://shop.somesite.com/browse/?q=&product_type=instruments&category=guitars"
The one caveat here, is that the site is only loading 30 results per initial request. If the user wants to view more, they have to click on button "Load More Results..." at the bottom. After investigating this a bit, during initial load of page, only the request for top 30 is made (which makes sense), and after clicking the "Load More.." button, the URL is updated with a "pagex=2" appended and the container refreshes with 30 more results. After this, the button goes away and as user continues to scroll down the page, subsequent requests are made to the server to get the next 30 results, "pagex" value is incremented by one, container refreshed with results appended, rinse and repeat.
I'm not exactly sure how to handle pagination on sites, but the simplest solution I came up with is simply finding out what the max number "pagex" for each category, and just set the URLs to that number for starters.
For example, if you pass URL in browser:
"http://shop.somesite.com/browse/?q=&product_type=instruments&category=drums&pagex=22"
HTTP Response Code 200 is received and all results are rendered to page. Great! That gives me what I need!
But, say next week or so, 50 more items added, so now the max is "...pagex=24" I wouldn't get all the latest.
Or is 50 items removed and new max is "...pagex=20", I will get 404 response when requesting "22".
I would like to send a test response with the last known "good" max page number and based on HTTP Response provided, use that to decide what URL will be.
So, before I start any crawling, I would like to add 1 to "pagex" and check for 404. If 404 I know I'm still good, if I get 200, I need to keep adding 1 until I get 404, so I know where max is (or decrease if needed).
I can't seem to figure out if I can do this using Scrapy, of I have to use a different module to run this check first. I tried adding simple checks for testing purposes in the "parse" and "start_requests" methods, and no luck. start_requests doesn't seem to be able to handle responses and parse can check the response code, but will not update the URL as instructed.
I'm sure it's my poor coding skills (still new to this all), but I can't seem to find a viable solution....
Any thoughts or ideas are very much appreciated!
you can configure in scrapy which statuses to configure, that way you can make decisions for example in the parse method according to the response.status. Check how to handle statuses in the documentation. Example:
class MySpider(CrawlSpider):
handle_httpstatus_list = [404]

Categories

Resources