I am struggling to find a method in python which allows you to read data in a currently used web browser. Effectively, I am trying to download a massive dataframe of data on a locally controlled company webpage and implement it into a dataframe. The issue is that the website has a fairly complex authentication token process which I have not been able to bypass using Selenium using a slew of webdrivers, Requests, urllib, and cookielib using a variety of user parameters. I have given up on this front entirely as I am almost positive that there is more to the authentication process than can be achieved easily with these libraries.
However, I did manage to bypass the required tokenization process when I quickly tested opening a new tab in a current browser which was already logged in using WebBrowser. Classically, WebBrowser does not offer a read function meaning that even though the page can be opened the data on the page cannot be read into a pandas dataframe. This got me thinking I could use Win32com, open a browser, login, then run the rest of the script, but again, there is no general read ability of the dispatch for internet explorer meaning I can't send the information I want to pandas. I'm stumped. Any ideas?
I could acquire the necessary authentication token scripts, but I am sure that it would take a week or two before anything would happen on that front. I would obviously prefer to get something in the mean time while I wait for the actual auth scripts from the company.
Update: I received authentication tokens from the company, however it requires using a python package on another server I do not have access too, mostly because its an oddity that I am using Python in my department. Thus the above still applies - need a method for reading and manipulating an open browser.
Step-by-step
1) Start browser with Selenium.
2) Script should start waiting for certain element that inform you that you got required page and logged in.
3) You can use this new browser window to login to page manually.
4) Script detects that you are on required page and logged in.
5) Script processes page the way you like.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# start webdriver (opens Chrome in new window)
chrome = webdriver.Chrome()
# initialize waiter with maximum 300 seconds to wait.
waiter = WebDriverWait(chrome , 300)
# Will wait for appear of #logout element.
# I assume it shows that you are logged in.
wait.until(EC.presence_of_element_located(By.ID, "logout"))
# Extract data etc.
It might be easier if you use your Chrome user's profile. This way you may have previous session continued so you will not need to do any login actions.
options = webdriver.ChromeOptions()
options.add_argument("user-data-dir=FULL_PATH__TO_PROFILE")
chrome = webdriver.Chrome(chrome_options=options)
chrome.get("https://your_page_here")
Related
the thing is I can login to the site only from one device, which is my browser so I can't use selenium, so I have to use something else which opens my browser and copy the text from an element using python.
I think that you may use one of two solutions:
use selenium to open a browser every time, and automate the login process.
If the site allows you to access without logging in when you access with your default browser(such as Stack Overflow for example - you don't need to login every time you open the website from your device), you can use the same browser profile and it should do the same job - login automatically to that website.
references:
what's a browser profile?
how to load a chrome's browser profile with python3 and selenium?
How can I bypass the Google CAPTCHA using Selenium and Python?
When I try to scrape something, Google give me a CAPTCHA. Can I bypass the Google CAPTCHA with Selenium Python?
As an example, it's Google reCAPTCHA. You can see this CAPTCHA via this link: https://www.google.com/recaptcha/api2/demo
To start with using Selenium's Python clients, you should avoid solving/bypass Google CAPTCHA.
Selenium
Selenium automates browsers. Now, what you want to achieve with that power is entirely up to individuals, but primarily it is for automating web applications through browser clients for testing purposes and of coarse it is certainly not limited to that.
CAPTCHA
On the other hand, CAPTCHA (the acronym being ...Completely Automated Public Turing test to tell Computers and Humans Apart...) is a type of challenge–response test used in computing to determine if the user is human.
So, Selenium and CAPTCHA serves two completely different purposes and ideally shouldn't be used to achieve any interrelated tasks.
Having said that, reCAPTCHA can easily detect the network traffic and identify your program as a Selenium driven bot.
Generic Solution
However, there are some generic approaches to avoid getting detected while web scraping:
The first and foremost attribute a website can determine your script/program by is through your monitor size. So it is recommended not to use the conventional Viewport.
If you need to send multiple requests to a website, keep on changing the User Agent on each request. Here you can find a detailed discussion on Way to change Google Chrome user agent in Selenium?
To simulate humanlike behavior, you may require to slow down the script execution even beyond WebDriverWait and expected_conditions inducing time.sleep(secs). Here you can find a detailed discussion on How to sleep Selenium WebDriver in Python for milliseconds
This use case
However, in a couple of use cases we were able to interact with the reCAPTCHA using Selenium and you can find more details in the following discussions:
How to click on the reCAPTCHA using Selenium and Java
CSS selector for reCAPTCHA checkbok using Selenium and VBA Excel
Find the reCAPTCHA element and click on it — Python + Selenium
References
You can find a couple of related discussion in:
How can I make a Selenium script undetectable using GeckoDriver and Firefox through Python?
Is there a version of Selenium WebDriver that is not detectable?
tl; dr
How does reCAPTCHA 3 know I'm using Selenium/chromedriver?
In order to bypass the CAPTCHA when scraping Google, you have to manually solve a CAPTCHA and export the cookies Google gives you. Now, every time you open a Selenium WebDriver, make sure you add the cookies you exported. The GOOGLE_ABUSE_EXEMPTION cookie is the one you're looking for, but I would save all cookies just to be on the safe side.
If you want an additional layer of stability in your scrapes, you should export several cookies and have your script randomly select one of them each time you ping Google.
These cookies have a long expiration date so you wouldn't need to get new cookies every day.
For help on saving and loading cookies in Python and Selenium, you should check out this answer: How to save and load cookies using Python + Selenium WebDriver
Clear Browsing History, cached data, cookies and other site data
First Create an Google Account while you are in browser window opened by selenium.
Sign in to your account
wd.get("https://accounts.google.com/signin/v2/identifier?hl=en&passive=true&continue=https%3A%2F%2Fwww.google.com%2F%3Fgws_rd%3Dssl&ec=GAZAmgQ&flowName=GlifWebSignIn&flowEntry=ServiceLogin");
Thread.sleep(2000);
wd.findElement(By.name("identifier")).sendKeys("Email"+Keys.ENTER);
Thread.sleep(3000);
wd.findElement(By.name("password")).sendKeys("Password"+Keys.ENTER);
Thread.sleep(5000);
Then Open any website that uses recaptcha tick on checkmark using this code
String framename=wd.findElement(By.tagName("iframe")).getAttribute("name");
wd.switchTo().frame(framename);
wd.findElement(By.xpath("//span[#id='recaptcha-anchor']")).click();
You won't find any Puzzles or anything.
Bypass as in solve it or bypass as in never get it at all?
To solve it:
sign up with 2captcha, capmonster cloud, deathbycaptcha, etc. and follow their instructions. They will give you a token that you pass with the form.
To never get it at all:
Make sure you have good IP reputation (most important for Cloudflare).
Make sure you have a good browser fingerprint (most important for Distil) - I recommend puppeteer + the stealth plugin.
Ok, so there is a simple python script to solve captcha for you.
It basically read the audio and then use google assistant to convert it into text and paste it.
It is only workable in audio captchas which is given the most case with imahe captcha V2
https://github.com/ohyicong/recaptcha_v2_solver
Disclaimer!
I do not write the script, i just get an idea of doing this but got this brother project so, thought to help others through this.
The simple solution is suspend the program for 10 seconds or more and then when the automated browser opens solve the reCAPTCHA on your own and then the program starts after 10 seconds and execute rest of the program like clicking submit button or other things
I am interested in using Selenium with Python to allow multiple bots to play poker against themselves on Pokernow (https://www.pokernow.club). You can create your own poker game and share a link for others to join. I have written a bot using Selenium that creates a game (and is player 1) and instantiated a new webdriver (with the shareable link) for a second bot to join the game. If I use the same webdriver browser (Chrome), however, the site recognizes that the p2 request is coming from the same source as p1 and assumes that p2 is p1. This behavior also occurs if done manually using the same browser, even using incognito mode.
This can be fixed by instantiating the second webdriver with Safari, however I am curious if there is a more elegant solution to allow both webdrivers to use Chrome without the site recognizing that they are requesting from the same source. I would like to have more than two players and I am running out of additional browsers to use.
Probably recognizes using cookies. You can try to use new instance of webdriver for each player. Every instance uses new profile, and this should make browsers independent:
driver1 = webdriver.Chrome() # for player 1
driver2 = webdriver.Chrome() # for player 2
You can also use Selenium Hub with Docker and completely separate or use different browsers.
I have this python code, which accesses a website using the module webbrowser:
import webbrowser
webbrowser.open('kahoot.it')
How could I input information into a text box on this website?
I suggest you use Selenium for that matter.
Here is an example code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
# driver = webdriver.Firefox() # Use this if you prefer Firefox.
driver = webdriver.Chrome()
driver.get('http://www.google.com/')
search_input = driver.find_elements_by_css_selector('input.gLFyf.gsfi')[0]
search_input.send_keys('some search string' + Keys.RETURN)
You can use Selenium better if you know HTML and CSS well. Knowing Javascript/JQuery may help too.
You need the specific webdriver to run it properly:
GeckoDriver (Firefox)
Chrome
There are other webdrivers available, but one of the previous should be enough for you.
On Windows, you should have the executable on the same folder as your code. On Ubuntu, you should copy the webdriver file to /usr/local/bin/
You can use Selenium not only to input information, but also to a lot of other utilities.
I don't think that's doable with the webbrowser module, I suggest you take a look at Selenium
How to use Selenium with Python?
Depending on how complex (interactive, reliant on scripts, ...) your activity is, you can use requests or, as others have suggested, selenium.
Requests allows you to send and get basic data from websites, you would probably use this when automatically submitting an order form, querying an API, checking if a page has ben updated, ...
Selenium gives you programmatic control of a "normal" browser, this seems better for you specific use-case.
The webbrowser module is actually only (more or less) able to open a browser. You can use this if you want to open a link from inside your application.
I have a webdriver using selenium that opens a browser for me, points it to an ip Address, does a bunch of stuff and closes.
I want to know all of the urls accessed during this time. That is, any ads that are loaded, any css calls that were made out to any url and so on.
Here is the code im using
from selenium import webdriver
browser = webdriver.Firefox(profile) # Get local session of firefox
browser.get(url) # Open a url and wait for it to finish
I did it by loading the firefox plugins Firebug and Netexport. The first is a tool that allows you to see all the exchange of information, the second allows to write all of it in a file (.har extension). So basically selenium has to load the plugins, the website and wait the time you want, and when it closes, you get a file with the result.
Its not a python solution.. But you can add fiddler plug in to Firefox. We needed to do exact same thing about a year ago. We used selenium to open browser and all UI stuff and in background Fiddler captured all traffic (http and https) .. This also list all JS CSS src and you can debug later with inspector see what request is sent and what response is received