Selenium - ERR_TOO_MANY_REDIRECTS [duplicate] - python

I am trying to automate my login to a webpage to download a daily xml. I understand that I need to have the actual frame url I think is
http://shop.braintrust.gr/shop/store/customerauthenticateform.asp
I examine the form and the fields and I do the following
browser = webdriver.Chrome('C:\\chromedriver.exe')
browser.get('http://shop.braintrust.gr/shop/store/customerauthenticateform.asp')
print('Browser Opened')
username = browser.find_element_by_name('UserID')
username.send_keys(email)
password = browser.find_element_by_name('password')
# time.sleep(2)
password.send_keys(pwd)
but I get a blank page saying that browser did a lot of redirections this means that it is impossible to login?
How can I login?
thank you

ERR_TOO_MANY_REDIRECTS
ERR_TOO_MANY_REDIRECTS (also known as a redirect loop) is one of the regular website errors. Typically this error occurs after a recent change to your website, a mis-configuration of redirects on your server or wrong settings with third-party services.
This error have no relation with Selenium as such and can be reproduced through Manual Steps.
The reason for ERR_TOO_MANY_REDIRECTS is that, something is causing your website to go into an infinite redirection loop. Essentially the site is stuck (such as URL 1 points to URL 2 and URL 2 points back to URL 1, or the domain has redirected you too many times) and unlike some other errors, these rarely resolve themselves and will probably need you to take action to fix it. There are a couple different variations of this error depending upon the browser you’re running.
Solution
Some common approach to check and fix the error as as follows:
Delete Cookies on That Specific Site: Google and Mozilla both in fact recommends right below the error to try clearing your cookies. Cookies can sometimes contain faulty data in which could cause the ERR_TOO_MANY_REDIRECTS error. This is one recommendation you can try even if you’re encountering the error on a site you don’t own. Due to the fact that cookies retain your logged in status on sites and other settings, in these cases simply deleting the cookie(s) on the site that is having the problem. This way you won’t impact any of your other sessions or websites that you frequently visit.
Clear Browser Cache: If you want to check and see if it might be your browser cache, without clearing your cache, you can always open up your browser in incognito mode. Or test another browser and see if you still see the ERR_TOO_MANY_REDIRECTS error.
Determine Nature of Redirect Loop: If clearing the cache didn’t work, then you’ll want to see if you can determine the nature of the redirect loop. For example, if a site has a 301 redirect loop back to itself, which is causing a large chain of faulty redirects. You can follow all the redirects and determine whether or not its looping back to itself, or perhaps is an HTTP to HTTPS loop.
Check Your HTTPS Settings: Another thing to check is your HTTPS settings. A lot of times it is observed ERR_TOO_MANY_REDIRECTS occur when someone has just migrated their WordPress site to HTTPS and either didn’t finish or setup something incorrectly.
Check Third-Party Services: ERR_TOO_MANY_REDIRECTS is also often commonly caused by reverse-proxy services such as Cloudflare. This usually happens when their Flexible SSL option is enabled and you already have an SSL certificate installed with your WordPress host. Why? Because, when flexible is selected, all requests to your hosting server are sent over HTTP. Your host server most likely already has a redirect in place from HTTP to HTTPS, and therefore a redirect loop occurs.
Check Redirects on Your Server: Besides HTTP to HTTPS redirects on your server, it can be good to check and make sure there aren’t any additional redirects setup wrong. For example, one bad 301 redirect back to itself could take down your site. Usually, these are found in your server’s config files.

Related

How to modify the redirect url in Microsoft login in Flask with nginx, ubuntu

My Flask web application requires a Microsoft login feature.
The Microsoft login requires a redirect url that redirects the user back to Flask after successfully login in Microsoft.
My HTTPS structure is: Nginx listening at 443, and proxy all request to http://127.0.0.1:5000, where my Flask app is running.
(The most popular method I found for running Flask as production mode using HTTPS )
Now comes the problem: The redirect url sent to Microsoft login, is http://127.0.0.1:5000
But all other redirects, e.g.: (ignore my function names, you know what I mean)
resp = resp.set(url_for('index'),200)
return resp
or
return redirect('/whatever_page')
are all redirected as https://example.com/{whatever_page}, which is completely fine.
But when it comes to the redirect url used in Microsoft login, it failed.
The Microsoft login code I am using is basically the flask demo I downloaded from Microsoft, all I did is simply changed the entire thing into another function of my Flask app, and call it when I need it. I did went through the codes step by step and did everything I could to make things right, yet nothing worked.
I have tried changing the IP in proxy_pass to the public IP and some other IPs, they didn't work but I can see the IP used in redirect url changes with the proxy_pass IP.
I have tried many configurations that might make things right, for examples:
proxy_redirect: http://127.0.0.1:5000/getAToken http://example.com/getAToken
or
proxy_redirect: http://127.0.0.1:5000/ https://example.com/
or
proxy_redirect: http://127.0.0.1:5000/ http://example.com/
or
proxy_redirect: http://127.0.0.1:5000/getAToken https://example.com/getAToken
Exedra
None of the proxy_redirect configurations can affect the redirect url used in Microsoft login, but all other redirects work.
I even tried to change the redirect url manually by modifying the login url Microsoft generated, it didn't work since the redirect_url is used to encrypt the token or whatever it is.
My current hacky options are:
Running back to developmental using HTTPS, but I need to restart the service every 1h or so to avoid the "not responding" problem in developmental server. Costs are: Lost of cache and potential crash when more users come. (which is what I am currently using to keep the website up)
Use a window server. Developmental mode on window server doesn't stop responding, it may at some point but I honestly don't know, I do have one running for months and never stop responding. The problem here is I don't know when it will stop responding.
Buy another server and use it as a login server, it can be restarted every hour since
cache or whatever is doesn't matter there.
But I really want to solve this problem without using any options above, and modify the redirect URL before the Microsoft login url is generated is the best option I can think of, I just don't know how to.
If there are other options other than these, please let me know, I will really appreciate it.

Python requests - Automated queries

I've been trying to write a script that logs into an account and grabs data for the last few days, but I can't manage to get it to login and I always encounter this error message:
Your computer or network may be sending automated queries. To protect
our users, we can't process your request right now.
I assume this is the error message provided by ReCaptcha v2, I'm using a ReCaptcha service, but I even get this error message on my machine locally without or with a proxy.
I've tried different proxies, different proxy sources, headers, user agents, nothing seems to work. I've used requests, and I still get this error message, Selenium and still get this error message and my own browser and still get this error message.
What kind of workaround is there to prevent this?
So I am writing this answer from my general experience with web scraping.
Different web application react differently under different conditions, the solutions I am giving here may not fully solve your problem.
Here are a few work around methodologies:
Use selenium only and set a proper window screen size. Most modern web apps identify users based on window size and user agent. In your case it is not recommended going for other solutions such as requests which do not allow proper handling of window size.
Use a modern valid user agent (Mozilla 5.0 compatible). Usually a Chrome browser > 60.0 UA will work good.
Keep chaining and changing proxies with each interval of xxx requests (depends upon your work load).
Use a single user agent for a specific proxy. If your UA keeps changing for a specific IP, Recaptcha will grab you as automated.
Handle cookies properly. Make sure the cookies set by the server are sent with subsequent requests (for a single proxy).
Use time gap between requests. Use time.sleep() to delay consecutive requests. Usually a time delay of 2 seconds would be enough.
I know this would considerably slow down your work, but Recaptcha is something that which is designated to prevent such automated queries/scraping.

Using python to parse a webpage that is already open

From this question, the last responder seems to think that it is possible to use python to open a webpage, let me sign in manually, go through a bunch of menus then let the python parse the page when I get where I want. The website has a weird sign in procedure so using requests and passing a user name and password will not be sufficient.
However it seems from this question that it's not a possibility.
SO the question is, is it possible? if so, do you know of some example code out there?
The way to approach this problem is when you login normally have the developer tools next to you and see what the request is sending.
When logging in to bandcamp the XHR request that's being sent is the following:
From that response you can see that an identity cookie is being sent. That's probably how they identify that you are logged in. So when you've got that cookie set you would be authorized to view logged in pages.
So in your program you could login normally using requests, save the cookie in a variable and then apply the cookie to further requests using requests.
Of course login procedures and how this authorization mechanism works may differ, but that's the general gist of it.
So when do you actually need selenium? You need it if a lot of the things are being rendered by javascript. requests is only able to get the html. So if the menus and such is rendered with javascript you won't ever be able to see that information using requests.

Disable SSL for Heroku App (django)

We've decided not to use SSL anymore and unfortunately our server guy has quit and now I need to fix this. I've revoked the certs from Comodo, removed the SSL app from Heroku but that was apparently not enough and now we have serious problems with our site.
When visiting inteokej.nu one gets redirected to the app, but automatically http turns to https and instead of showing the domain (inteokej.nu) the app link is shown https://inteokej.herokuapp.com (I want inteokej.nu to be shown, not the actual app link).
That is a problem but not the biggest problem, which is that it's not possible to use the site anymore (e.g login, the static pages works though). When I try to login I first get a https security error and when I proceed I get to the following page: https://www.inteokej.nu/cgi-sys/defaultwebpage.cgi ("Sorry! If you are the owner of this website, please contact your hosting provider: webmaster#inteokej.nu").
I've now learned the hard way that SSL is a complex thing but I really need to get this site up again as soon as possible. So, where should I start and how could I proceed from this point? I guess there's some back end coding that should be done in the django code as well?
Thanks a lot in advance!
Your issue doesn't seem to be with SSL but DNS or at least however your server guy set things up.
The error page you're seeing isn't a Heroku error, inteokej.nu isn't being hosted on Heroku but on a server run by your DNS provider svenskadomaner.se .
If you use the Firefox Live HTTP Headers plugin you can follow the request/response cycle and you'll see that there is a 301 redirect from www.inteokej.nu to inteokej.herokuapp.com (probably an .htaccess redirect).
Check the DNS records for your domain (like here http://viewdns.info/dnsrecord/?domain=inteokej.nu ) you'll see that there is no CNAME record to Heroku, only an A Record to 46.22.116.5 which is an IP Address owned by svenskadomaner.se.
So the thing to do is to set up the custom domain as recommended on Heroku's site:
https://devcenter.heroku.com/articles/custom-domains
and set the CNAME to Heroku's recommendation.
One reason your server guy might have set things up like they did is that Heroku doesn't easily allow "naked domains", so people often do .htaccess redirects from example.com to www.example (which does work easily with CNAMEs).
Good luck!

Google App Engine URL Fetch Doesn't Work on Production

I am using google app engine's urlfetch feature to remotely log into another web service. Everything works fine on development, but when I move to production the login procedure fails. Do you have any suggestions on how to debug production URL fetch?
I am using cookies and other headers in my URL fetch (I manually set up the cookies within the header). One of the cookies is a session cookie.
There is no error or exception. On production, posting a login to the URL command returns the session cookies but when you request a page using the session cookies, they are ignored and you are prompted for login information again. On development once you get the session cookies you can access the internal pages just fine. I thought the problem was related to saving the cookies, but they look correct as the requests are nearly identical.
This is how I call it:
fetchresp = urlfetch.fetch(url=req.get_full_url(),
payload=req.get_data(),
method=method,
headers=all_headers,
allow_truncated=False,
follow_redirects=False,
deadline=10
)
Here are some guesses as to the problem:
The distributed nature of google's url fetch implementation is messing things up.
On production, headers are sent in a different order than in development, perhaps confusing the server.
Some of google's servers are blacklisted by the target server.
Here are some hypothesis that I've ruled out:
Google caching is too aggressive. But I still get the problem after turning off cache by using the header Cache-Control: no-store.
Google's urlfetch is too fast for the target server. But I still get the problem after inserting delays between calls.
Google appends some data to the User-Agent header. But I have added that header to development and I don't get the problem.
What other differences are there between the production URL fetch and the development URL fetch? Do you have any ideas for debugging this?
UPDATE 2
(First update was incorporated above)
I don't know if it was something I did (maybe adding delays or disabling caches mentioned above) but now the production environment works about 50% of the time. This definitely looks like a race condition. Unfortunately, I have no idea if the problem is in my code, google's code, or the target server's code.
As others have mentioned, the key differences between dev and prod are the originating IP, and how some of the request headers are handled. See here for a list of restricted headers. I don't know if this is documented, but in prod, your app ID is appended to the end of your user agent. I had an issue once where requests in prod only were getting detected as a search engine spider because my app ID contained the string "bot".
You mentioned that you're setting up cookies manually, including the session cookie. Does this mean that you established a session in Dev, and then you're trying to re-use it in prod? Is it possible that the remote server is logging the source IP that establishes a session, and requiring that subsequent requests come from the same IP?
You said that it doesn't work, but you don't get an exception. What exactly does this mean? You get an HTTP 200 and an empty response body? Another HTTP status? Your best bet may be to contact the owners of the remote service and see if they can tell you more specifically what was wrong with your request. Anything else is just speculation.
Check your server's logs to see if GAE is chopping any headers off. I've noticed that GAE (thought I think I've seen it on the dev server) will chop off headers it doesn't like.
Depending on the web service you're calling, it might also be less ok with GAE calling it than your local machine.
I ran across this issue while making a webapp with an analogous issue- when looking at urlfetch's documentation, it turns out that the maximum timeout for a fetch call is 60 seconds, but it defaults to 5 seconds.
5 seconds on my local machine was long enough to request URLs on my local machine, but on GAE it was only consistently completing its task in 5 seconds about 20% of the time.
I included the parameter deadline=60 and it has been working fine since.
Hope this helps others!

Categories

Resources