new to python (4 months)
got thru the first steps of basic programming skills, I believe, (having passed edX MIT 6001x and 60002x)
having big problems in the world of new libraries...
here an example:
r= requests.get ('URL',timeout=x)
works well with certain URL, keeps waiting with some other URL and I am getting
HTTPSConnectionPool(host='URL', port=443): Read timed out. (read timeout=x)
and without the timeout parameter, the jupyter notebook keeps turning the sand-watch.
I am not trying to handle the exception but to get it work.
Is there a simple way out or is requests.get too short for these kind of tasks?
And a more general question here,if you have the time: learning from the official docs (especially for larger and more complex modules) is getting too abstract for me, where it makes me feel hopeless. 'Straight diving' produces problems such as this one where you even cant figure out the simplest..
What would be an efficient way to deal with state of the art libraries? How did/do you go forward?
Try to check a file "robots.txt" of the website whose content you're trying to scrape (just type something like www.example.com/robots.txt). It's plausible that the website simply does not allow robots to use it. If this is the case, you may try to trick it by passing a custom header:
import requests
headers={'user-agent':'Chrome/41.0.2228.0'}
url='...'
r=requests.get(url, headers=headers, timeout=x)
But, of course, if you make thousands of queries to a website which does not allow robots, you'll still be blocked after a while.
Related
I have a webscraper I wrote with Python/Selenium that automatically reserves a spot at my gym for me every morning (You have to reserve at 7am and they fill up quick so I just automated it to run at 7 every day). It's been working well for me for a while but a couple days ago it stopped working. So I got up early and checked what was going on - to find that this gym has added Captcha to its reservation process.
Does this mean that someone working on the website added a Captcha to it? Or is it Google-added? Regardless, am I screwed? Is there any way for my bot to get around Captcha?
I found that when I run the Selenium script the Captcha requires addition steps (i.e finding all the crosswalks), whereas when I try to reserve manually the Captcha is still there but it only requires me to click on it before moving on. Is this something I can take advantage of?
Thank you in advance for any help.
I've run into similar problems before. Sometimes you're just stuck and can't get past it. That's exactly what Captcha is meant to accomplish, after all.
However, I've found that sometimes the site will only present you with Captcha if it suspects based on your behavior that you are a bot. This can be partially overcome, especially if you're only making occasional calls to a site, by making your bot somewhat less predictable. I do this using np.random. I use a Poisson distribution to simulate user actions within the context of an individual session, since the time between actions is often Poisson distributed. And I randomize the time I log into a site by simply randomly choosing a time within a certain range. These simple actions are highly effective, although eventually most sites will figure out what you're doing.
Before you implement either of these solutions, however, I strongly recommend you read the site's Terms of Use and consider whether overcoming their Captcha is a violation. If you signed a use agreement with them the right thing to do is to honor it, even if it's somewhat inconvenient. I'd argue this separate ethical decision is of much greater importance than the technical challenge of trying to bypass their Captcha.
Try use https://github.com/dessant/buster to solve captcha
implementation in python selenium -> repository
What is a responsible / ethical time delay to put in a web crawler that only crawls one root page?
I'm using time.sleep(#) between the following calls
requests.get(url)
I'm looking for a rough idea on what timescales are:
1. Way too conservative
2. Standard
3. Going to cause problems / get you noticed
I want to touch every page (at least 20,000, probably a lot more) meeting certain criteria. Is this feasible within a reasonable timeframe?
EDIT
This question is less about avoiding being blocked (though any relevant info. would be appreciated) and rather what time delays do not cause issues to the host website / servers.
I've tested with 10 second time delays and around 50 pages. I just don't have a clue if I'm being over cautious.
I'd check their robots.txt. If it lists a crawl-delay, use it! If not, try something reasonable (this depends on the size of the page). If it's a large page, try 2/second. If it's a simple .txt file, 10/sec should be fine.
If all else fails, contact the site owner to see what they're capable of handling nicely.
(I'm assuming this is an amateur server with minimal bandwidth)
I've done a little basic coding in Java, VBA and C++ before but still a complete newbie, I'm trying to understand what exactly it is that I could do with Python and how it would be useful.
For example, at my appointment-setting job I get customer data in an Excel sheet from my boss, that I'm supposed to put into the CRM system before calling them. It takes me probably at least 5 minutes each time and it's very tedious. Could I automate this process using Python? I've been looking into learning Python recently and thought this might be a good first goal-project in that case, if it's possible and not too difficult to do.
Thanks in advance.
I'm not sure what CRM is, but pulling excel content into your python program is easy. I have used this (https://automatetheboringstuff.com/chapter12/) for processing excel content in the past.
Yes, you can use Python for this. Automate the Boring Stuff with Python is a good resource for getting started. Once you understand how to do this, you need to find out if your CRM has some sort of API or tool that will allow you to post data with your Python code. Chances are good that it does.
I've been building a website for some time, and I'm still stuck with that thing:
I store some small videos (~400MB approx. at best) for my website inside of a dbm database, and I'd like to stream them on my website.
I'm building the request handlers by hand using the Tornado python framework, and I was wondering how to build my handler. I never found how media stream worked and didn't find a lot of topics on the web.
So the complete result I'd want to achieve is having a web player on my website, where I can request specific videos, then playing them without having to load the entire file in memory/send it in 1 request.
These two links:
One for Tornado only: this appears to use special annotations.
One for Flask: albeit a motion JPEG example, it shows how you
can return a function that performs a "while" loop as a response.
Appear to be the answers you are looking for. And guess what? So am I!
Note that both use the "yield" keyword in python. It is unclear to me if the "coroutine" and "asynchronous" decorators are necessary in the Flask example (in other words, it is unclear if the example given in the link is complete...although, he literally wrote the book about it, so I suspect it is).
Beware: tests show that tornado.web holds on to the ENTIRE file during download, even if you stream it (i.e. read, write, flush, read...). The reason for this is unclear, and I have yet to find a way around it.
I've got a bottle-based HTTP server that mostly shuffles JSON data around. When I run this in Python 2.7 it works perfectly, and in my route handlers I can access the JSON data via bottle.request.json. However, when I run it under Python 3.4 bottle.request.json is None.
I've examined the HTTP traffic, and in both cases it is exactly the same (as would expected since that's under control of the non-Python-dependent client.)
I also see that the JSON data is reaching bottle in both cases. If I print out bottle.request.params.keys(), I see the string-ified JSON as the only entry in the list in both cases. And the strings are identical in both cases. For some reason, however, the Python 2 version is recognizing the JSON data while the Python 3 version isn't.
Strangely, this used to work, but some recent change either in my code or bottle (or both) has broken things. Looking over my code, though, I can't see what I might have done to create the problem.
Does anyone know what's going on? Is this something I'm doing wrong at the client end, at the bottle configuration end, or is this a bottle defect? I searched for this problem both on google and the bottle issue tracker, but to no avail.
This turns out to have nothing to do with bottle. The ultimate cause of the problem is that the client request has two Content-Type headers due to a defect in an emacs lisp HTTP library. Embarrassingly, I've known about this defect for quite some time, but I thought I'd properly worked around it.
I'm not 100% sure why I see the variance between Python 2 and 3, but my guess right now is that it has to do with otherwise benign changes in the WSGI machinery between the versions.