Python urllib2: Getting blocked from a website?

Python urllib2: Getting blocked from a website? - python

I had been using urllib2 to parse data from html webpages. It was working perfectly for some time and stopped working permanently from one website.
Not only did the script stop working, but I was no longer able to access the website at all, from any browser. In fact, the only way I could reach the website was from a proxy, leading me to believe that requests from my computer were blocked.
Is this possible? Has this happened to anyone else? If that is the case, is there anyway to get unblocked?

It is indeed possible, maybe the sysadmin noticed that your IP was making way too many requests and decided to block it.
It could also be that the server has a limit of requests that you exceeded.
If you don't have a static IP, a restart of your router should reset your IP, making the ban useless.

Related

Python requests - Automated queries

I've been trying to write a script that logs into an account and grabs data for the last few days, but I can't manage to get it to login and I always encounter this error message:
Your computer or network may be sending automated queries. To protect
our users, we can't process your request right now.
I assume this is the error message provided by ReCaptcha v2, I'm using a ReCaptcha service, but I even get this error message on my machine locally without or with a proxy.
I've tried different proxies, different proxy sources, headers, user agents, nothing seems to work. I've used requests, and I still get this error message, Selenium and still get this error message and my own browser and still get this error message.
What kind of workaround is there to prevent this?

So I am writing this answer from my general experience with web scraping.
Different web application react differently under different conditions, the solutions I am giving here may not fully solve your problem.
Here are a few work around methodologies:
Use selenium only and set a proper window screen size. Most modern web apps identify users based on window size and user agent. In your case it is not recommended going for other solutions such as requests which do not allow proper handling of window size.
Use a modern valid user agent (Mozilla 5.0 compatible). Usually a Chrome browser > 60.0 UA will work good.
Keep chaining and changing proxies with each interval of xxx requests (depends upon your work load).
Use a single user agent for a specific proxy. If your UA keeps changing for a specific IP, Recaptcha will grab you as automated.
Handle cookies properly. Make sure the cookies set by the server are sent with subsequent requests (for a single proxy).
Use time gap between requests. Use time.sleep() to delay consecutive requests. Usually a time delay of 2 seconds would be enough.
I know this would considerably slow down your work, but Recaptcha is something that which is designated to prevent such automated queries/scraping.

Python URL Request under corporate proxy

I'm writing this application where the user can perform a web search to obtain some information from a particular website.
Everything works well except when I'm connected to the Internet via Proxy (it's a corporate proxy).
The thing is, it works sometimes.
By sometimes I mean that if it stops working, all I have to do is to use any web browser (Chrome, IE, etc.) to surf the internet and then python's requests start working as before.
The error I get is:
OSError('Tunnel connection failed: 407 Proxy Authentication Required',)
My guess is that some sort of credentials are validated and the proxy tunnel is up again.
I tried with the proxies handlers but it remains the same.
My doubts are:
How do I know if the proxy need authentication, and if so, how to do it without hardcoding the username and password since this application will be used by others?
Is there a way to use the Windows default proxy configuration so it will work like the browsers do?
What do you think that happens when I surf the internet and then the python requests start working again?
I tried with requests and urllib.request
Any help is appreciated.
Thank you!

Check if there is any proxy setting in chrome

Python requests to online store not checking out

I am new here so bear with me if I break the etiquette for this forum. Anyway, I've been working on a python project for a while now and I'm nearing the end but I've been dealing with the same problem for a couple of days now and I can't figure out what the issue is.
I'm using python and the requests module to send a post request to the checkout page of an online store. The response I get when i send it in is the page where you put in your information, not the page that says your order was confirmed, etc.
At first I thought that it could be the form data that I was sending in, and I was right. I checked what it was supposed to be in the network tab on chrome and i saw I was sending in 'Visa' and it was supposed to be 'visa' but it still didn't work after that. Then I thought it could be the encoding but I have no clue how to check what kind the site takes.
Do any of you have any ideas of what could be preventing this from working? Thanks.
EDIT: I realized that I wasn't sending a Cookie in the request headers, so I fixed that and it's still not working. I set up a server script that prints the request on another computer and posted to that instead and the requests are exactly the same, both headers and body. I have no clue what it could possibly be.

web scraping with selenium phantom js or python requests - every 2-3 pages server returns 'bad' page

I've been scrping happily with selenium/phantom js. Recently, I noticed that one of the websites I am scraping, started returning a 'bad' page (page with no relevant content every 2-3 pages) - not clear why. I tested with python requests and I am getting similar results (issues) although it is slightly better (more like 3-4 pages before I get a bad one).
What I do:
I have a page url list that I shuffle - so it is unlikely to have the same scraping pattern.
I have a random 10-20 seconds sleep between requests (none of it is urgent)
I tried with and without cookies
I tried rotating IP addresses (bounce my server between scrapes and get new IP address)
I checked robots.txt - not doing anything 'bad'
User agent is set in a similar manner to what I get on my laptop (http://whatsmyuseragent.com/)
phantomjs desired capabilities set to a dictionary identical to DesiredCapabilities.CHROME (I actually created my own Chrome dictionary and embedded the real chrome version I am using).
JavaScript enabled (although I do not really need it)
I set ignore ssl errors using service_args=['--ignore-ssl-errors=true']
I only run the scrape twice a day ~9 hours apart. Issues are the same whether I run the code on my laptop or on Ubuntu somewhere in the cloud.
Thoughts?

If the server is throttling or blocking you, you need to contact the admin of the server and ask him to whitelist you.
There is nothing you can do except trying to scrape even slower.
If the server is overloaded you could try different times of the day. If the server is bugged, try to reproduce it and inform the admin.

Big requests issue: GET doesnt release/reset TCP connections, loop crashes

im using python3.3 and the requests module to scrape links from an arbitrary webpage. My program works as follows: I have a list of urls which in the beginning has just the starting url in it.
The program loops over that list and gives the urls to a procedure GetLinks, where im using requests.get and Beautifulsoup to extract all links. Before that procedure appends links to my urllist it gives them to another procedure testLinks to see whether its an internal, external or broken link. In the testLinks im using requests.get too to be able to handle redirects etc.
The program worked really well so far, i tested it with quite some wesites and was able to get all links of pages with like 2000 sites etc. But yesterday i encountered a problem on one page, by looking on the Kaspersky Network Monitor. On this page some TCP connections just dont reset, it seems to me that in that case, the initial request for my first url dont get reset, the connection time is as long as my program runs.
Ok so far. My first try was to use requests.head instead of .get in my testLinks procedure. And then everything works fine! The connections are released as wanted. But the problem is, the information i get from requests.head is not sufficient, im not able to see the redirected url and how many redirects took place.
Then i tried requests.head with
allow_redirects=True
But unfortunately this is not a real .head request, it is a usual .get request. So i got the same problem. I also tried to use to set the parameter
keep_alive=False
but it didnt work either. I even tried to use urllib.request(url).geturl() in my testLinks for redirect issues, but here the same problem occurs, the TCP connections dont get reset.
I tried so much to avoid this problem, i used request sessions but it also had the same problem. I also tried a request.post with the header information Connection: close but it didnt worked.
I analyzed some links where i think it gets struck and so far i believe it has something to do with redirects like 301->302. But im really not sure because on all the other websites i tested it there mustve been such a redirect, they are quite common.
I hope someone can help me. For Information im using a VPN connection to be able to see all websites, because the country im in right now blocks some pages, which are interesting for me. But of course i tested it without the VPN and i had the same problem.
Maybe theres a workaround, because request.head in testLinks is sufficient if i just would be able in case of redirects to see the finnish url and maybe the number of redirects.
If the text is not well readable, i will provide a scheme of my code.
Thanks alot!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.