Python requests - Automated queries - python

I've been trying to write a script that logs into an account and grabs data for the last few days, but I can't manage to get it to login and I always encounter this error message:
Your computer or network may be sending automated queries. To protect
our users, we can't process your request right now.
I assume this is the error message provided by ReCaptcha v2, I'm using a ReCaptcha service, but I even get this error message on my machine locally without or with a proxy.
I've tried different proxies, different proxy sources, headers, user agents, nothing seems to work. I've used requests, and I still get this error message, Selenium and still get this error message and my own browser and still get this error message.
What kind of workaround is there to prevent this?

So I am writing this answer from my general experience with web scraping.
Different web application react differently under different conditions, the solutions I am giving here may not fully solve your problem.
Here are a few work around methodologies:
Use selenium only and set a proper window screen size. Most modern web apps identify users based on window size and user agent. In your case it is not recommended going for other solutions such as requests which do not allow proper handling of window size.
Use a modern valid user agent (Mozilla 5.0 compatible). Usually a Chrome browser > 60.0 UA will work good.
Keep chaining and changing proxies with each interval of xxx requests (depends upon your work load).
Use a single user agent for a specific proxy. If your UA keeps changing for a specific IP, Recaptcha will grab you as automated.
Handle cookies properly. Make sure the cookies set by the server are sent with subsequent requests (for a single proxy).
Use time gap between requests. Use time.sleep() to delay consecutive requests. Usually a time delay of 2 seconds would be enough.
I know this would considerably slow down your work, but Recaptcha is something that which is designated to prevent such automated queries/scraping.

Related

If user is signed into a google account can user still face the "I'm not a robot" message?

If you are signed into a google account, will you still get the "I'm not a robot message", or is there no other way to avoid it besides paid services. I am using the Selenium library in python.
I'm not a robot reCAPTCHA
This is a challenging test to differentiate between humans and automated bots based on the response. reCAPTCHA is one of the CAPTCHA spam protection services bought by Google. Automated robots are the biggest headache for producing spams and consuming server resources which supposed to be utilized by real users. In order to avoid automated bots Google introduced No CAPTCHA reCAPTCHA API concept for website owners to protect their sites. Later to improve user experience, Google introduced invisible reCAPTCHA.
Invisible CAPTCHA helps to stop bots without showing I'm not a robot message to human users. But it does not work on many situation as the message will be still shown. For example, Google search page itself will show the I'm not a robot CAPTCHA message on certain circumstances when you enter the query and hit search button. You will be asked to prove you are a human by selecting the checkbox or selecting images based on the given hint.
When you do a real Google search and getting interrupted with I'm not a robot message will make you really embarrassed. Sometimes it will allow you with a simple click on the checkbox. Google will check the clicking position on the checkbox. Bots click exactly on the center of the checkbox while humans click somewhere on the box. This will help to decide Google whether the user is a human or bot. In the worst case, Google will completely stop you by showing the sorry page. The only option you have here is to wait and try later.
In the worst case, Google will completely stop you by showing the sorry page. The only option you have here is to wait and try later.
Root cause of I'm not a robot reCAPTCHA message
Some of the main reasons of this error are as follows:
When Google automatically detects requests coming from your computer network which appear to be in violation of the Terms of Service. The block will expire shortly after those requests stop.
This traffic may have been sent by malicious software, a browser plug-in, or a script that sends automated requests. A different computer using the same IP address may be responsible.
Sometimes you may see this page if you are using advanced terms that robots are known to use, or sending requests very quickly.
Fixing I'm not a robot reCAPTCHA issue
If you are always getting interrupted then a couple of remediation steps are as follows:
Stop using VPN.
Avoid unknown proxy servers.
Use Google public DNS.
Stop searching illegal queries.
Slow your clicks.
Stop sending automated queries.
Search like a human.
Check for malware and browser extensions.
Most of the websites have a barrier for preventing automated test software from signing in or even browsing through them, and since Selenium is just a automated web testing tool you are debarred from signing in, one of the ways to fix it, is by using pyautogui for just the basic sign in and then carry forwarding with your selenium code, or maybe using API for the particular google service which you require.
Even if the user is authorized under the google account, it does not protect him from the fact that at any site or his request captcha, because the whole point is that and determine who the bot and who a man ... The only solution how to bypass captchas is to use a service to recognize such captchas and solve them automatically

Selenium - ERR_TOO_MANY_REDIRECTS [duplicate]

I am trying to automate my login to a webpage to download a daily xml. I understand that I need to have the actual frame url I think is
http://shop.braintrust.gr/shop/store/customerauthenticateform.asp
I examine the form and the fields and I do the following
browser = webdriver.Chrome('C:\\chromedriver.exe')
browser.get('http://shop.braintrust.gr/shop/store/customerauthenticateform.asp')
print('Browser Opened')
username = browser.find_element_by_name('UserID')
username.send_keys(email)
password = browser.find_element_by_name('password')
# time.sleep(2)
password.send_keys(pwd)
but I get a blank page saying that browser did a lot of redirections this means that it is impossible to login?
How can I login?
thank you
ERR_TOO_MANY_REDIRECTS
ERR_TOO_MANY_REDIRECTS (also known as a redirect loop) is one of the regular website errors. Typically this error occurs after a recent change to your website, a mis-configuration of redirects on your server or wrong settings with third-party services.
This error have no relation with Selenium as such and can be reproduced through Manual Steps.
The reason for ERR_TOO_MANY_REDIRECTS is that, something is causing your website to go into an infinite redirection loop. Essentially the site is stuck (such as URL 1 points to URL 2 and URL 2 points back to URL 1, or the domain has redirected you too many times) and unlike some other errors, these rarely resolve themselves and will probably need you to take action to fix it. There are a couple different variations of this error depending upon the browser you’re running.
Solution
Some common approach to check and fix the error as as follows:
Delete Cookies on That Specific Site: Google and Mozilla both in fact recommends right below the error to try clearing your cookies. Cookies can sometimes contain faulty data in which could cause the ERR_TOO_MANY_REDIRECTS error. This is one recommendation you can try even if you’re encountering the error on a site you don’t own. Due to the fact that cookies retain your logged in status on sites and other settings, in these cases simply deleting the cookie(s) on the site that is having the problem. This way you won’t impact any of your other sessions or websites that you frequently visit.
Clear Browser Cache: If you want to check and see if it might be your browser cache, without clearing your cache, you can always open up your browser in incognito mode. Or test another browser and see if you still see the ERR_TOO_MANY_REDIRECTS error.
Determine Nature of Redirect Loop: If clearing the cache didn’t work, then you’ll want to see if you can determine the nature of the redirect loop. For example, if a site has a 301 redirect loop back to itself, which is causing a large chain of faulty redirects. You can follow all the redirects and determine whether or not its looping back to itself, or perhaps is an HTTP to HTTPS loop.
Check Your HTTPS Settings: Another thing to check is your HTTPS settings. A lot of times it is observed ERR_TOO_MANY_REDIRECTS occur when someone has just migrated their WordPress site to HTTPS and either didn’t finish or setup something incorrectly.
Check Third-Party Services: ERR_TOO_MANY_REDIRECTS is also often commonly caused by reverse-proxy services such as Cloudflare. This usually happens when their Flexible SSL option is enabled and you already have an SSL certificate installed with your WordPress host. Why? Because, when flexible is selected, all requests to your hosting server are sent over HTTP. Your host server most likely already has a redirect in place from HTTP to HTTPS, and therefore a redirect loop occurs.
Check Redirects on Your Server: Besides HTTP to HTTPS redirects on your server, it can be good to check and make sure there aren’t any additional redirects setup wrong. For example, one bad 301 redirect back to itself could take down your site. Usually, these are found in your server’s config files.

Is it possible to let the client interact with recaptcha on the server side?

I'm using rpyc server to get data using selenium when a connection to a client is established, the problem is that the url I'm trying to access occasionally prompts a reCaptcha to fill in order to access the data needed.
I don't really need to find a way to automate a completion, what I do want is to find a way to stream the browser from the server to the client if it encounters a reCaptcha, in a manner that allows the user to interact with the browser, and fill the reCaptcha manually himself, and from there to let the server go on with the rest of his code.
Something similar to Teamviewer's functionality, to implement in my setup.
I actually couldn't find any direction to follow on that subject yet, and couldn't figure out a method to try myself.
If you work with Selenium, then you have the opportunity to programmatically wait for elements or to detect elements. You could just have your program wait for the ReCaptcha element and make an output to the user in the console that he should solve the ReCaptcha. In the background your program already waits for the elements that appear when the ReCaptcha is solved. Once the user has solved the ReCaptcha, the program will automatically resume.

web scraping with selenium phantom js or python requests - every 2-3 pages server returns 'bad' page

I've been scrping happily with selenium/phantom js. Recently, I noticed that one of the websites I am scraping, started returning a 'bad' page (page with no relevant content every 2-3 pages) - not clear why. I tested with python requests and I am getting similar results (issues) although it is slightly better (more like 3-4 pages before I get a bad one).
What I do:
I have a page url list that I shuffle - so it is unlikely to have the same scraping pattern.
I have a random 10-20 seconds sleep between requests (none of it is urgent)
I tried with and without cookies
I tried rotating IP addresses (bounce my server between scrapes and get new IP address)
I checked robots.txt - not doing anything 'bad'
User agent is set in a similar manner to what I get on my laptop (http://whatsmyuseragent.com/)
phantomjs desired capabilities set to a dictionary identical to DesiredCapabilities.CHROME (I actually created my own Chrome dictionary and embedded the real chrome version I am using).
JavaScript enabled (although I do not really need it)
I set ignore ssl errors using service_args=['--ignore-ssl-errors=true']
I only run the scrape twice a day ~9 hours apart. Issues are the same whether I run the code on my laptop or on Ubuntu somewhere in the cloud.
Thoughts?
If the server is throttling or blocking you, you need to contact the admin of the server and ask him to whitelist you.
There is nothing you can do except trying to scrape even slower.
If the server is overloaded you could try different times of the day. If the server is bugged, try to reproduce it and inform the admin.

Capturing browser specific rendering of a webpage?

Is there any way to capture (image, pdf etc) how a webpage will look like in lets say chrome or I.E? I am guessing there will be different ways to do this for different browsers but is there any API, library or addon that does this?
Use selenium webdriver (has a python api) to remote control a browser and take a screenshot. Supports all major browsers as far as I'm aware.
Yes there are few wonderful websites providing this service and also some kinds of primitive to some advanced API services for capturing browser screenshots.
Browsershots.org
Its quite slow most of the times, may be due to the heavy traffic it has to withstand. However its one of the best screenshots provider.
[LINK]http://browsershots.org/xmlrpc/ Check this url to understand how to use the XMLRPC based API for browsershots.
And if you want some primitive and straight forward type thumbnailing services, may be the following sites work good for you.
http://www.thumbalizr.com/
http://api1.thumbalizr.com/?url=http://acpmasquerade.com&width=some_width
I checked another website webshotspro.com and when I queued one for a snapshot, it said my queue was behind 7053 other requests. the loading icon keeps rotating :P
Give a try with the XMLRPC call from Browsershots.org

Categories

Resources