Using Mechanize for python, need to be able to right click - python

My script logs in to my account, navigates the links it needs to, but I need to download an image. This seems to be easy enough to do using urlretrive. The problem is that the src attribute for the image contains a link which points it to the page which initiates a download prompt, and so my only foreseeable option is to right click and select 'save as'. I'm using mechanize and from what I can tell Mechanize doesn't have this functionality. My question is should I switch to something like Selenium?

Mechanize, last I checked, was pretty poorly maintained and documented. Selenium has a much more active community.
That being said: why do you need mechanize to do this? Why not just use urllib?

I would try to watch Chrome's network tab, and try to imitate the final request to get the image. If it turned out to be too difficult, then I would use selenium as you suggested.

Related

How to automatically log in using python

I want to use python to automatically enter a website, login to it and maybe pressing some buttons or something like that(for example entering an online class automatically).
And I also don't want to use selenium or something like that(I don't want to use web drivers).
And if you are suggesting me the "requests" please tell me how should I do it.
thanks a lot for your help.
You cannot use requests or beautifulsoup or urllib for clicking purposes,
Here is the answer for why you can't
I suggest you to use selenium it is easy to use here is the documentation
https://www.selenium.dev/documentation/en/
And you can download chrome driver from here (if you are using chrome):
https://chromedriver.chromium.org/downloads
If you want a detailed explanation on how to install driver and start creating some scripts i found a youtube video which explains it all:
https://www.youtube.com/watch?v=8iAqUVvytJk&ab_channel=TheAmericanDeveloper

I get site blocking and captcha when using Python Selenium webdriver [duplicate]

How can I bypass the Google CAPTCHA using Selenium and Python?
When I try to scrape something, Google give me a CAPTCHA. Can I bypass the Google CAPTCHA with Selenium Python?
As an example, it's Google reCAPTCHA. You can see this CAPTCHA via this link: https://www.google.com/recaptcha/api2/demo
To start with using Selenium's Python clients, you should avoid solving/bypass Google CAPTCHA.
Selenium
Selenium automates browsers. Now, what you want to achieve with that power is entirely up to individuals, but primarily it is for automating web applications through browser clients for testing purposes and of coarse it is certainly not limited to that.
CAPTCHA
On the other hand, CAPTCHA (the acronym being ...Completely Automated Public Turing test to tell Computers and Humans Apart...) is a type of challenge–response test used in computing to determine if the user is human.
So, Selenium and CAPTCHA serves two completely different purposes and ideally shouldn't be used to achieve any interrelated tasks.
Having said that, reCAPTCHA can easily detect the network traffic and identify your program as a Selenium driven bot.
Generic Solution
However, there are some generic approaches to avoid getting detected while web scraping:
The first and foremost attribute a website can determine your script/program by is through your monitor size. So it is recommended not to use the conventional Viewport.
If you need to send multiple requests to a website, keep on changing the User Agent on each request. Here you can find a detailed discussion on Way to change Google Chrome user agent in Selenium?
To simulate humanlike behavior, you may require to slow down the script execution even beyond WebDriverWait and expected_conditions inducing time.sleep(secs). Here you can find a detailed discussion on How to sleep Selenium WebDriver in Python for milliseconds
This use case
However, in a couple of use cases we were able to interact with the reCAPTCHA using Selenium and you can find more details in the following discussions:
How to click on the reCAPTCHA using Selenium and Java
CSS selector for reCAPTCHA checkbok using Selenium and VBA Excel
Find the reCAPTCHA element and click on it — Python + Selenium
References
You can find a couple of related discussion in:
How can I make a Selenium script undetectable using GeckoDriver and Firefox through Python?
Is there a version of Selenium WebDriver that is not detectable?
tl; dr
How does reCAPTCHA 3 know I'm using Selenium/chromedriver?
In order to bypass the CAPTCHA when scraping Google, you have to manually solve a CAPTCHA and export the cookies Google gives you. Now, every time you open a Selenium WebDriver, make sure you add the cookies you exported. The GOOGLE_ABUSE_EXEMPTION cookie is the one you're looking for, but I would save all cookies just to be on the safe side.
If you want an additional layer of stability in your scrapes, you should export several cookies and have your script randomly select one of them each time you ping Google.
These cookies have a long expiration date so you wouldn't need to get new cookies every day.
For help on saving and loading cookies in Python and Selenium, you should check out this answer: How to save and load cookies using Python + Selenium WebDriver
Clear Browsing History, cached data, cookies and other site data
First Create an Google Account while you are in browser window opened by selenium.
Sign in to your account
wd.get("https://accounts.google.com/signin/v2/identifier?hl=en&passive=true&continue=https%3A%2F%2Fwww.google.com%2F%3Fgws_rd%3Dssl&ec=GAZAmgQ&flowName=GlifWebSignIn&flowEntry=ServiceLogin");
Thread.sleep(2000);
wd.findElement(By.name("identifier")).sendKeys("Email"+Keys.ENTER);
Thread.sleep(3000);
wd.findElement(By.name("password")).sendKeys("Password"+Keys.ENTER);
Thread.sleep(5000);
Then Open any website that uses recaptcha tick on checkmark using this code
String framename=wd.findElement(By.tagName("iframe")).getAttribute("name");
wd.switchTo().frame(framename);
wd.findElement(By.xpath("//span[#id='recaptcha-anchor']")).click();
You won't find any Puzzles or anything.
Bypass as in solve it or bypass as in never get it at all?
To solve it:
sign up with 2captcha, capmonster cloud, deathbycaptcha, etc. and follow their instructions. They will give you a token that you pass with the form.
To never get it at all:
Make sure you have good IP reputation (most important for Cloudflare).
Make sure you have a good browser fingerprint (most important for Distil) - I recommend puppeteer + the stealth plugin.
Ok, so there is a simple python script to solve captcha for you.
It basically read the audio and then use google assistant to convert it into text and paste it.
It is only workable in audio captchas which is given the most case with imahe captcha V2
https://github.com/ohyicong/recaptcha_v2_solver
Disclaimer!
I do not write the script, i just get an idea of doing this but got this brother project so, thought to help others through this.
The simple solution is suspend the program for 10 seconds or more and then when the automated browser opens solve the reCAPTCHA on your own and then the program starts after 10 seconds and execute rest of the program like clicking submit button or other things

How can I bypass the Google CAPTCHA with Selenium and Python?

How can I bypass the Google CAPTCHA using Selenium and Python?
When I try to scrape something, Google give me a CAPTCHA. Can I bypass the Google CAPTCHA with Selenium Python?
As an example, it's Google reCAPTCHA. You can see this CAPTCHA via this link: https://www.google.com/recaptcha/api2/demo
To start with using Selenium's Python clients, you should avoid solving/bypass Google CAPTCHA.
Selenium
Selenium automates browsers. Now, what you want to achieve with that power is entirely up to individuals, but primarily it is for automating web applications through browser clients for testing purposes and of coarse it is certainly not limited to that.
CAPTCHA
On the other hand, CAPTCHA (the acronym being ...Completely Automated Public Turing test to tell Computers and Humans Apart...) is a type of challenge–response test used in computing to determine if the user is human.
So, Selenium and CAPTCHA serves two completely different purposes and ideally shouldn't be used to achieve any interrelated tasks.
Having said that, reCAPTCHA can easily detect the network traffic and identify your program as a Selenium driven bot.
Generic Solution
However, there are some generic approaches to avoid getting detected while web scraping:
The first and foremost attribute a website can determine your script/program by is through your monitor size. So it is recommended not to use the conventional Viewport.
If you need to send multiple requests to a website, keep on changing the User Agent on each request. Here you can find a detailed discussion on Way to change Google Chrome user agent in Selenium?
To simulate humanlike behavior, you may require to slow down the script execution even beyond WebDriverWait and expected_conditions inducing time.sleep(secs). Here you can find a detailed discussion on How to sleep Selenium WebDriver in Python for milliseconds
This use case
However, in a couple of use cases we were able to interact with the reCAPTCHA using Selenium and you can find more details in the following discussions:
How to click on the reCAPTCHA using Selenium and Java
CSS selector for reCAPTCHA checkbok using Selenium and VBA Excel
Find the reCAPTCHA element and click on it — Python + Selenium
References
You can find a couple of related discussion in:
How can I make a Selenium script undetectable using GeckoDriver and Firefox through Python?
Is there a version of Selenium WebDriver that is not detectable?
tl; dr
How does reCAPTCHA 3 know I'm using Selenium/chromedriver?
In order to bypass the CAPTCHA when scraping Google, you have to manually solve a CAPTCHA and export the cookies Google gives you. Now, every time you open a Selenium WebDriver, make sure you add the cookies you exported. The GOOGLE_ABUSE_EXEMPTION cookie is the one you're looking for, but I would save all cookies just to be on the safe side.
If you want an additional layer of stability in your scrapes, you should export several cookies and have your script randomly select one of them each time you ping Google.
These cookies have a long expiration date so you wouldn't need to get new cookies every day.
For help on saving and loading cookies in Python and Selenium, you should check out this answer: How to save and load cookies using Python + Selenium WebDriver
Clear Browsing History, cached data, cookies and other site data
First Create an Google Account while you are in browser window opened by selenium.
Sign in to your account
wd.get("https://accounts.google.com/signin/v2/identifier?hl=en&passive=true&continue=https%3A%2F%2Fwww.google.com%2F%3Fgws_rd%3Dssl&ec=GAZAmgQ&flowName=GlifWebSignIn&flowEntry=ServiceLogin");
Thread.sleep(2000);
wd.findElement(By.name("identifier")).sendKeys("Email"+Keys.ENTER);
Thread.sleep(3000);
wd.findElement(By.name("password")).sendKeys("Password"+Keys.ENTER);
Thread.sleep(5000);
Then Open any website that uses recaptcha tick on checkmark using this code
String framename=wd.findElement(By.tagName("iframe")).getAttribute("name");
wd.switchTo().frame(framename);
wd.findElement(By.xpath("//span[#id='recaptcha-anchor']")).click();
You won't find any Puzzles or anything.
Bypass as in solve it or bypass as in never get it at all?
To solve it:
sign up with 2captcha, capmonster cloud, deathbycaptcha, etc. and follow their instructions. They will give you a token that you pass with the form.
To never get it at all:
Make sure you have good IP reputation (most important for Cloudflare).
Make sure you have a good browser fingerprint (most important for Distil) - I recommend puppeteer + the stealth plugin.
Ok, so there is a simple python script to solve captcha for you.
It basically read the audio and then use google assistant to convert it into text and paste it.
It is only workable in audio captchas which is given the most case with imahe captcha V2
https://github.com/ohyicong/recaptcha_v2_solver
Disclaimer!
I do not write the script, i just get an idea of doing this but got this brother project so, thought to help others through this.
The simple solution is suspend the program for 10 seconds or more and then when the automated browser opens solve the reCAPTCHA on your own and then the program starts after 10 seconds and execute rest of the program like clicking submit button or other things

Using Python's Requests library to navigate webpages / Click buttons

I'm new to web programming, and have recently began looking into using Python to automate some manual processes. What I'm trying to do is log into a site, click some drop-down menus to select settings, and run a report.
I've found the acclaimed requests library: http://docs.python-requests.org/en/latest/user/advanced/#request-and-response-objects
and have been trying to figure out how to use it.
I've successfully logged in using bpbp's answer on this page: How to use Python to login to a webpage and retrieve cookies for later usage?
My understanding of "clicking" a button is to write a post() command that mimics a click: Python - clicking a javascript button
My question (since I'm new to web programming and this library) is how I would go about pulling the data I need to figure out how I would construct these commands. I've been looking into [RequestObject].headers, .text, etc. Any examples would be great.
As always, thanks for your help!
EDIT:::
To make this question more concrete, I'm having trouble interacting with different aspects of a web-page. The following image shows what I'm actually trying to do:
I'm on a web-page that looks like this. There is a drop-down menu with click-able dates that can be changed. My goal is to automate changing the date to the most recent date, "click"'Save and Run', and download the report when it's finished running.
The only solution to this I have found is Selenium. If it werent a javascript heavy website you could try mechanize but for this you need to render the javascript and then inject javascript...like Selenium does.
Upside: You can record actions in Firefox (using selenium) then export those actions to python. The downside is that this code has to open a browser window to run.

Advanced screen-scraping using curl

I need to create a script that will log into an authenticated page and download a pdf.
However, the pdf that I need to download is not at a URL, but it is generated upon clicking on a specific input button on the page. When I check the HTML source, it only gives me the url of the button graphic and some obscure name of the button input, and action=".".
In addition, both the url where the button is and the form name is obscured, for example:
url = /WebObjects/MyStore.woa/wo/5.2.0.5.7.3
input name = 0.0.5.7.1.1.11.19.1.13.13.1.1
How would I log into the page, 'click' that button, and download the pdf file within a script?
Maybe Mechanize module can help.
I think that url on clicking the button maybe generated using javascript.So, to run javascript code from python script take a look at Spidermonkey.
Try mechanize or twill. HttpFox or firebug can help you to build your queries. Remember you can also pickle cookies from browser and use it later with py libs. If the code is generated by javascript it could be possible to 'reverse engineer' it. If nof you can run some javascript interpret or use selenium or windmill to script a real browser.
You could observe what requests are made when you click the button (using Firebug in Firefox or Developer Tools in Chrome). You may then be able to request the PDF directly.
It's difficult to help without seeing the page in question.
As Acorn said, you should try monitoring the actual requests and see if you can spot a pattern.
If not, then your best bet is actually to automate a fully-featured browser, that will be able to run Javascript, so you'll exactly mimic what a regular user would do. Have a look at this page on the Python Wiki for ideas, check the section Python Wrappers around Web "Libraries" and Browser Technology.

Categories

Resources