I am trying to scrape a dynamic content (javascript) page with Python + Selenium + BS4 and the page blocks my requests at random (the soft might be: F5 AMS).
I managed to bypass this thing by changing the user-agent for each of the browsers I have specified. The thing is, only the Chrome driver can pass over the rejection. Same code, adjusted for PhantomJS or Firefox drivers is blocked constantly, like I am not even changing the user agent.
I must say that I am also multithreading, that meaning, starting 4 browsers at the same time.
Why does this happen? What does Chrome Webdriver have to offer that can pass over the firewall and the rest don't?
I really need to get the results because I want to change to Firefox, therefore, I want to make Firefox pass just as Chrome.
Two words: Browser Fingerprinting. It's a huge topic in it's own right and as Tarun mentioned would take a decent amount of research to nail this issue on its head. But possible I believe.
Related
Long story short all I am trying to do is scrape the contents of a certain page. Unfortunately, the specific info I need on that page is within an iFrame and I have tried several headless browser options, all yielding the same response which is the HTML displaying:
<iframe>Your browser does not support iframe</iframe>
In Python I have tried both Selenium (even tried the --web-security=no & --disable-web-security flags) & PhantomJS (so I know it's not JavaScript related), and in NodeJS I've tried Puppeteer, all of which aren't working...
Is there anything else out there I can try that may work?
Also, no, a direct GET request is useless because the page detects it's not a real user and loads nothing entirely regardless of user-agent etc etc so I really need a browser solution that can preferably be headless
Are there any alternatives to Selenium that don't require a web driver or browser to operate? I recently moved my code over to a Google Cloud VM instance, and when I run it there are multiple errors. I've been trying to get it to work for hours but just can't (no luck with PhantomJS, Chrome and GeckoDriver - tried re-downloading browsers, editing the sources.list file e.c.t.).
The page I'm web scraping uses JavaScript to load in numbers, which I was I initially chose Selenium. Everything else works perfectly though!
You could simply use the request library.
https://requests.readthedocs.io/en/master/
https://anaconda.org/anaconda/requests
You would then need to send a GET or POST request to the server.
If you do not know how to generate a proper POST request, simply try to "record" it.
If you have chrome, got to the page you want to navigate, press F12, navigate to the "Network" section and write method:POST into the filter.
Further info here:
https://stackoverflow.com/a/39661536/11971785
At first it is a bit more confusing than selenium, but once you understand it its waaaay better in my opinion.
Also the Java values shown on the page can usually be simply read out of the java code which is returned by your request.
No web driver or anything required and a lot more stable and customizable.
How can I bypass the Google CAPTCHA using Selenium and Python?
When I try to scrape something, Google give me a CAPTCHA. Can I bypass the Google CAPTCHA with Selenium Python?
As an example, it's Google reCAPTCHA. You can see this CAPTCHA via this link: https://www.google.com/recaptcha/api2/demo
To start with using Selenium's Python clients, you should avoid solving/bypass Google CAPTCHA.
Selenium
Selenium automates browsers. Now, what you want to achieve with that power is entirely up to individuals, but primarily it is for automating web applications through browser clients for testing purposes and of coarse it is certainly not limited to that.
CAPTCHA
On the other hand, CAPTCHA (the acronym being ...Completely Automated Public Turing test to tell Computers and Humans Apart...) is a type of challenge–response test used in computing to determine if the user is human.
So, Selenium and CAPTCHA serves two completely different purposes and ideally shouldn't be used to achieve any interrelated tasks.
Having said that, reCAPTCHA can easily detect the network traffic and identify your program as a Selenium driven bot.
Generic Solution
However, there are some generic approaches to avoid getting detected while web scraping:
The first and foremost attribute a website can determine your script/program by is through your monitor size. So it is recommended not to use the conventional Viewport.
If you need to send multiple requests to a website, keep on changing the User Agent on each request. Here you can find a detailed discussion on Way to change Google Chrome user agent in Selenium?
To simulate humanlike behavior, you may require to slow down the script execution even beyond WebDriverWait and expected_conditions inducing time.sleep(secs). Here you can find a detailed discussion on How to sleep Selenium WebDriver in Python for milliseconds
This use case
However, in a couple of use cases we were able to interact with the reCAPTCHA using Selenium and you can find more details in the following discussions:
How to click on the reCAPTCHA using Selenium and Java
CSS selector for reCAPTCHA checkbok using Selenium and VBA Excel
Find the reCAPTCHA element and click on it — Python + Selenium
References
You can find a couple of related discussion in:
How can I make a Selenium script undetectable using GeckoDriver and Firefox through Python?
Is there a version of Selenium WebDriver that is not detectable?
tl; dr
How does reCAPTCHA 3 know I'm using Selenium/chromedriver?
In order to bypass the CAPTCHA when scraping Google, you have to manually solve a CAPTCHA and export the cookies Google gives you. Now, every time you open a Selenium WebDriver, make sure you add the cookies you exported. The GOOGLE_ABUSE_EXEMPTION cookie is the one you're looking for, but I would save all cookies just to be on the safe side.
If you want an additional layer of stability in your scrapes, you should export several cookies and have your script randomly select one of them each time you ping Google.
These cookies have a long expiration date so you wouldn't need to get new cookies every day.
For help on saving and loading cookies in Python and Selenium, you should check out this answer: How to save and load cookies using Python + Selenium WebDriver
Clear Browsing History, cached data, cookies and other site data
First Create an Google Account while you are in browser window opened by selenium.
Sign in to your account
wd.get("https://accounts.google.com/signin/v2/identifier?hl=en&passive=true&continue=https%3A%2F%2Fwww.google.com%2F%3Fgws_rd%3Dssl&ec=GAZAmgQ&flowName=GlifWebSignIn&flowEntry=ServiceLogin");
Thread.sleep(2000);
wd.findElement(By.name("identifier")).sendKeys("Email"+Keys.ENTER);
Thread.sleep(3000);
wd.findElement(By.name("password")).sendKeys("Password"+Keys.ENTER);
Thread.sleep(5000);
Then Open any website that uses recaptcha tick on checkmark using this code
String framename=wd.findElement(By.tagName("iframe")).getAttribute("name");
wd.switchTo().frame(framename);
wd.findElement(By.xpath("//span[#id='recaptcha-anchor']")).click();
You won't find any Puzzles or anything.
Bypass as in solve it or bypass as in never get it at all?
To solve it:
sign up with 2captcha, capmonster cloud, deathbycaptcha, etc. and follow their instructions. They will give you a token that you pass with the form.
To never get it at all:
Make sure you have good IP reputation (most important for Cloudflare).
Make sure you have a good browser fingerprint (most important for Distil) - I recommend puppeteer + the stealth plugin.
Ok, so there is a simple python script to solve captcha for you.
It basically read the audio and then use google assistant to convert it into text and paste it.
It is only workable in audio captchas which is given the most case with imahe captcha V2
https://github.com/ohyicong/recaptcha_v2_solver
Disclaimer!
I do not write the script, i just get an idea of doing this but got this brother project so, thought to help others through this.
The simple solution is suspend the program for 10 seconds or more and then when the automated browser opens solve the reCAPTCHA on your own and then the program starts after 10 seconds and execute rest of the program like clicking submit button or other things
How can I bypass the Google CAPTCHA using Selenium and Python?
When I try to scrape something, Google give me a CAPTCHA. Can I bypass the Google CAPTCHA with Selenium Python?
As an example, it's Google reCAPTCHA. You can see this CAPTCHA via this link: https://www.google.com/recaptcha/api2/demo
To start with using Selenium's Python clients, you should avoid solving/bypass Google CAPTCHA.
Selenium
Selenium automates browsers. Now, what you want to achieve with that power is entirely up to individuals, but primarily it is for automating web applications through browser clients for testing purposes and of coarse it is certainly not limited to that.
CAPTCHA
On the other hand, CAPTCHA (the acronym being ...Completely Automated Public Turing test to tell Computers and Humans Apart...) is a type of challenge–response test used in computing to determine if the user is human.
So, Selenium and CAPTCHA serves two completely different purposes and ideally shouldn't be used to achieve any interrelated tasks.
Having said that, reCAPTCHA can easily detect the network traffic and identify your program as a Selenium driven bot.
Generic Solution
However, there are some generic approaches to avoid getting detected while web scraping:
The first and foremost attribute a website can determine your script/program by is through your monitor size. So it is recommended not to use the conventional Viewport.
If you need to send multiple requests to a website, keep on changing the User Agent on each request. Here you can find a detailed discussion on Way to change Google Chrome user agent in Selenium?
To simulate humanlike behavior, you may require to slow down the script execution even beyond WebDriverWait and expected_conditions inducing time.sleep(secs). Here you can find a detailed discussion on How to sleep Selenium WebDriver in Python for milliseconds
This use case
However, in a couple of use cases we were able to interact with the reCAPTCHA using Selenium and you can find more details in the following discussions:
How to click on the reCAPTCHA using Selenium and Java
CSS selector for reCAPTCHA checkbok using Selenium and VBA Excel
Find the reCAPTCHA element and click on it — Python + Selenium
References
You can find a couple of related discussion in:
How can I make a Selenium script undetectable using GeckoDriver and Firefox through Python?
Is there a version of Selenium WebDriver that is not detectable?
tl; dr
How does reCAPTCHA 3 know I'm using Selenium/chromedriver?
In order to bypass the CAPTCHA when scraping Google, you have to manually solve a CAPTCHA and export the cookies Google gives you. Now, every time you open a Selenium WebDriver, make sure you add the cookies you exported. The GOOGLE_ABUSE_EXEMPTION cookie is the one you're looking for, but I would save all cookies just to be on the safe side.
If you want an additional layer of stability in your scrapes, you should export several cookies and have your script randomly select one of them each time you ping Google.
These cookies have a long expiration date so you wouldn't need to get new cookies every day.
For help on saving and loading cookies in Python and Selenium, you should check out this answer: How to save and load cookies using Python + Selenium WebDriver
Clear Browsing History, cached data, cookies and other site data
First Create an Google Account while you are in browser window opened by selenium.
Sign in to your account
wd.get("https://accounts.google.com/signin/v2/identifier?hl=en&passive=true&continue=https%3A%2F%2Fwww.google.com%2F%3Fgws_rd%3Dssl&ec=GAZAmgQ&flowName=GlifWebSignIn&flowEntry=ServiceLogin");
Thread.sleep(2000);
wd.findElement(By.name("identifier")).sendKeys("Email"+Keys.ENTER);
Thread.sleep(3000);
wd.findElement(By.name("password")).sendKeys("Password"+Keys.ENTER);
Thread.sleep(5000);
Then Open any website that uses recaptcha tick on checkmark using this code
String framename=wd.findElement(By.tagName("iframe")).getAttribute("name");
wd.switchTo().frame(framename);
wd.findElement(By.xpath("//span[#id='recaptcha-anchor']")).click();
You won't find any Puzzles or anything.
Bypass as in solve it or bypass as in never get it at all?
To solve it:
sign up with 2captcha, capmonster cloud, deathbycaptcha, etc. and follow their instructions. They will give you a token that you pass with the form.
To never get it at all:
Make sure you have good IP reputation (most important for Cloudflare).
Make sure you have a good browser fingerprint (most important for Distil) - I recommend puppeteer + the stealth plugin.
Ok, so there is a simple python script to solve captcha for you.
It basically read the audio and then use google assistant to convert it into text and paste it.
It is only workable in audio captchas which is given the most case with imahe captcha V2
https://github.com/ohyicong/recaptcha_v2_solver
Disclaimer!
I do not write the script, i just get an idea of doing this but got this brother project so, thought to help others through this.
The simple solution is suspend the program for 10 seconds or more and then when the automated browser opens solve the reCAPTCHA on your own and then the program starts after 10 seconds and execute rest of the program like clicking submit button or other things
I've been writing automated tests with Selenium Webdriver 2.45 in python. To get through some of the things I need to test I must retrieve the various JSESSION cookies that are generate from the site. When I use webdrivers get_cookies() function with Firefox or Chrome all of the needed cookies return to me. When I do the same thing with IE11 I do not see the cookies that I need. Anyone know how I can retrieve session cookies from IE?
What you describe sounds like an issue I ran into a few months ago. My tests ran fine with Chrome and Firefox but not in IE, and the problem was cookies. Upon investigation what I found is that my web site had set its session cookies to be HTTP-only. When a cookie has this flag turned on, the browser will send the cookie over the HTTP(S) protocol and allow it to be set by the server in responses but it will make the cookie inaccessible to JavaScript. (Which is consistent with your comment that you cannot see the cookies you want in document.cookie.) It so happens that when you use Selenium with Chrome or Firefox, Selenium is able to ignore this flag and obtain the cookies from the browser anyway. However, it cannot do the same with IE.
I worked around this issue by turning off the HTTP-only flag when running my site in testing mode. I use Django for my server so I had to create a special test_settings.py file with SESSION_COOKIE_HTTPONLY = False in it.
There is an open issue with IE and Safari. Those driver will not return correct cookies information. At least not the domain. See this