Responsible time delays - web crawling - python

What is a responsible / ethical time delay to put in a web crawler that only crawls one root page?
I'm using time.sleep(#) between the following calls
requests.get(url)
I'm looking for a rough idea on what timescales are:
1. Way too conservative
2. Standard
3. Going to cause problems / get you noticed
I want to touch every page (at least 20,000, probably a lot more) meeting certain criteria. Is this feasible within a reasonable timeframe?
EDIT
This question is less about avoiding being blocked (though any relevant info. would be appreciated) and rather what time delays do not cause issues to the host website / servers.
I've tested with 10 second time delays and around 50 pages. I just don't have a clue if I'm being over cautious.

I'd check their robots.txt. If it lists a crawl-delay, use it! If not, try something reasonable (this depends on the size of the page). If it's a large page, try 2/second. If it's a simple .txt file, 10/sec should be fine.
If all else fails, contact the site owner to see what they're capable of handling nicely.
(I'm assuming this is an amateur server with minimal bandwidth)

Related

Locust requests counter and users

I am currently working with the python performance test framework Locust.
I prepared a script that uses Locust as a library and runs multiple tests one after the other.
After every run I multiply the number of users with the test number (e.g. second test -> 2*user) in order to test how an API would respond to the change of this variable.
What I saw for "high" values of users was not what I was expecting, The number of requests sent stayed the same even after the increase of the user.
For a range between 100 to 1000 the requests count shown in the CSV files were practically the same and I wanted to better understand what could be the causes of this behaviour.
In this image, extracted from Grafana, every sawtooth wave represent the request count.
It can be seen that the request counter peaks are similar, although there is a difference of 100 users between one and the other.
Could this be a limitation of Locust as a library?
I tried to explore the documentation on this topic but I did not find anything about this problem.
If someone knows about a reliable source of information it would be very useful.
Thanks to everyone who will take the time to answer my question

Has Captcha completely invalidated my Selenium script?

I have a webscraper I wrote with Python/Selenium that automatically reserves a spot at my gym for me every morning (You have to reserve at 7am and they fill up quick so I just automated it to run at 7 every day). It's been working well for me for a while but a couple days ago it stopped working. So I got up early and checked what was going on - to find that this gym has added Captcha to its reservation process.
Does this mean that someone working on the website added a Captcha to it? Or is it Google-added? Regardless, am I screwed? Is there any way for my bot to get around Captcha?
I found that when I run the Selenium script the Captcha requires addition steps (i.e finding all the crosswalks), whereas when I try to reserve manually the Captcha is still there but it only requires me to click on it before moving on. Is this something I can take advantage of?
Thank you in advance for any help.
I've run into similar problems before. Sometimes you're just stuck and can't get past it. That's exactly what Captcha is meant to accomplish, after all.
However, I've found that sometimes the site will only present you with Captcha if it suspects based on your behavior that you are a bot. This can be partially overcome, especially if you're only making occasional calls to a site, by making your bot somewhat less predictable. I do this using np.random. I use a Poisson distribution to simulate user actions within the context of an individual session, since the time between actions is often Poisson distributed. And I randomize the time I log into a site by simply randomly choosing a time within a certain range. These simple actions are highly effective, although eventually most sites will figure out what you're doing.
Before you implement either of these solutions, however, I strongly recommend you read the site's Terms of Use and consider whether overcoming their Captcha is a violation. If you signed a use agreement with them the right thing to do is to honor it, even if it's somewhat inconvenient. I'd argue this separate ethical decision is of much greater importance than the technical challenge of trying to bypass their Captcha.
Try use https://github.com/dessant/buster to solve captcha
implementation in python selenium -> repository

How to measure how much download quota used when web page loaded?

I am attempting to quantify how much download quota would be consumed when a certain web page is loaded (in chrome in my case), including all the page's assets (e.g. 'loaded' according to regular human use of the webpage)
Is there a way to achieve this using mainstream techniques (e.g. a python library, selenium, netstat command line utility, curl or something else)?
Note: I guess one very crude way would be to check my ISP stats before/after the page load, but this is fraught with potential inaccuracies, probably most notably the device doing background tasks and the ISP not providing quota estimates fine enough to discern the additional kbs consumed by the page load, so I think this method would not be reliable
There may be better ways, but I found one that seems to work
In chrome, open developer tools (cmd + option + j), click on the 'network' tab, and refresh the page. When it has fully loaded, look for the resources.
Note: to get an accurate reading, it could be important to ensure the 'Disable cache' checkbox is ticked (failing to disallow the cache could underestimate the download quota required)
For the page we're on now, we see it uses 1.5MB without disabling the cache.
Note: the amount seemed to vary for me quite a bit each time I ran it (not always in a downward direction), so depending on the circumstances, it could be worth doing this several times and taking an average.

request.get not working // and a more general one

new to python (4 months)
got thru the first steps of basic programming skills, I believe, (having passed edX MIT 6001x and 60002x)
having big problems in the world of new libraries...
here an example:
r= requests.get ('URL',timeout=x)
works well with certain URL, keeps waiting with some other URL and I am getting
HTTPSConnectionPool(host='URL', port=443): Read timed out. (read timeout=x)
and without the timeout parameter, the jupyter notebook keeps turning the sand-watch.
I am not trying to handle the exception but to get it work.
Is there a simple way out or is requests.get too short for these kind of tasks?
And a more general question here,if you have the time: learning from the official docs (especially for larger and more complex modules) is getting too abstract for me, where it makes me feel hopeless. 'Straight diving' produces problems such as this one where you even cant figure out the simplest..
What would be an efficient way to deal with state of the art libraries? How did/do you go forward?
Try to check a file "robots.txt" of the website whose content you're trying to scrape (just type something like www.example.com/robots.txt). It's plausible that the website simply does not allow robots to use it. If this is the case, you may try to trick it by passing a custom header:
import requests
headers={'user-agent':'Chrome/41.0.2228.0'}
url='...'
r=requests.get(url, headers=headers, timeout=x)
But, of course, if you make thousands of queries to a website which does not allow robots, you'll still be blocked after a while.

Speed up feedparser

I'm using feedparser to print the top 5 Google news titles. I get all the information from the URL the same way as always.
x = 'https://news.google.com/news/feeds?pz=1&cf=all&ned=us&hl=en&topic=t&output=rss'
feed = fp.parse(x)
My problem is that I'm running this script when I start a shell, so that ~2 second lag gets quite annoying. Is this time delay primarily from communications through the network, or is it from parsing the file?
If it's from parsing the file, is there a way to only take what I need (since that is very minimal in this case)?
If it's from the former possibility, is there any way to speed this process up?
I suppose that a few delays are adding up:
The Python interpreter needs a while to start and import the module
Network communication takes a bit
Parsing probably consumes only little time but it does
I think there is no straightforward way of speeding things up, especially not the first point. My suggestion is that you have your feeds downloaded on a regularly basis (you could set up a cron job or write a Python daemon) and stored somewhere on your disk (i.e. a plain text file) so you just need to display them at your terminal's startup (echo would probably be the easiest and fastest).
I personally made good experiences with feedparser. I use it to download ~100 feeds every half hour with a Python daemon.
Parse at real time not better case if you want faster result.
You can try does it asynchronously by Celery or by similar other solutions. I like the Celery, it gives many abilities. There are abilities as task as the cron or async and more.

Categories

Resources