Scrape website data without BS or selenium (Python)

Scrape website data without BS or selenium (Python) - python

Basically my scenario is that I have the webpage open and want to copy some of the text from the website that is open on my screen ( there is a whole login process every time ) . For Security reasons, I do not want to have to continuously login to the webpage and for that reason, requests are not suitable. I also do not want to use selenium as it will open up a new browser when I wish to use my existing one. My question is with my browser already open on the page I want info from, is there some sort of script I can make that will retrieve certain information on the page for me and save it somewhere (almost like a macro but it's able to retrieve certain elements) . Is this a possibility?

I'm not sure if I understood the question correctly.
One way might be to download the entire .html and process the respective data "locally" after downloading the .html.

If you use "request", like with postman, you don't need to log in each time. If you have a valid JWT token you will skip login. But that depend how your stuff work (lack of details in your question).
I don't know about selenium, but with puppeteer (a concurrent), you can re-use an already installed browser instead of downloading a new one.
Also... do you even need selenium or puppeteer ? Can't you just run some code into your console (browser console) ? You can create and save snippets in source tab in chrome. If you need access to your file system directly (meaning the data you collect being downloaded automatically in download folder, or having download pop-up to choose folder, is not enough for you), you may give a look at TamperMonkey extension. Or maybe you need to make a chrome extension.
Update after reading your comment #JeanVanNiekerk:
// to get user name of the one asking.
console.log(
document.querySelector('#question .user-details a').innerText
); // 'Jean Van Niekerk'
navigator.clipboard.writeText('stuff').then(
e => {
console.log('Copied text ready !');
}
);
// If you write that above in the console, you
// will get `Uncaught (in promise) DOMException: Document is not focused.`
// This is a security (maybe it can be disabled for your special case, another
// option is to make an extension that has this kind of rights).
// To try it out right now, paste this code bellow into you console, and swiffly click on the page (anywhere)
setTimeout(() => {
navigator.clipboard.writeText('stuff').then(
e => {
console.log('Copied text ready !');
}
);
}, 1000);
// Ctrl+V to paste your text :)

Related

How to save a webpage by setting preferences in python for webdriver?

I am currently trying to save a webpage as it appears on the website in html format. Approach I am using is prerssing Ctrl + S using autoit. On pressing that save as dialog box appears where I am asked to enter the name of the file to save. This is working fine. However, I want to save the file by pressing Ctrl + S instead of bring the dialog box in front. I read somewhere by using "set_preference" we can do that. CAn someone suggest how to set a preference. Below is the code I am using for Chrome broswer:
driver=Webdriver.Chrome()
driver.get('http://www.yahoo.com/')
autoit.send("{CTRL down}")
autoit.send("{CTRL down}")
autoit.send("{CTRL up}")
autoit.send("C:\\Users\\karanjuneja\\Downloads\\kj\\ABCD.mhtml")
autoit.send("{ENTER}")
Currently I am using the aboved code, however I want that on pressing Ctrl + S it saves the file in the desired location.
Thanks
Karan

Selenium isn't the designed for this, you could either:
Use getHtmlSource and parse the resulting HTML for references to external files, which you can then download and store outside of Selenium.
Use something other than Selenium to download and store an offline version of a website - I'm sure there are plenty of tools that could do this if you do a search. For example WGet can perform a recursive download (http://en.wikipedia.org/wiki/Wget#Recursive_download)
Is there any reason you want to use Selenium? Is this part of your testing strategy or are you just wanting to find a tool that will create an offline copy of a page?

Interacting with pop-up boxes using selenium in python

I'm trying to use the Selenium module in python to generate a text list from one website, save it in a directory, and browse to that text list on another site to submit it.
I'm working on the script in two parts- 1. Get metadata and 2. Order data. I've successfully completed the script in part 1, except for the very last thing: Choosing to save the metadata file that was just generated. I left it alone to work on part 2, hoping I would stumble upon the answer, but I'm just reaching to same problem when the pop-up to choose file comes along.
In the documentation, I'm told that Selenium WebDriver has built-in support for handling popup dialog boxes and that after triggering a dialog box, if I call alert = driver.switch_to_alert() then I can "accept, dismiss, read its contents, or even type into a prompt."
However, it's not working. When I try alert.text('some text') or alert.send_keys(Keys.TAB), I keep getting the error NoAlertPresentException: Message: No alert is present and after adding the command to wait, I get the error TimeoutException: Message:
Are the popups I'm getting (screenshots attached) not recognized by Selenium? If so, how do I interact with them? It seems like using this to save and/or upload files is something that many people have to do, but I cannot find anything on Google. Specifically, I would like to choose 'Save File' then 'OK' for the first image and for the second I would like to browse to the file (i.e. enter the path into the file name field) and click 'Open.' I don't want to just change my Firefox settings to automatically save because this will eventually be run in a different environment, and that won't help solve my second problem.
Thanks!
EDIT:
I'm testing my script on windows but it will eventually be implemented on a linux cloud server. I thought I was going to have to switch to PhantomJS webdriver (which was probably going to make my problem worse) to do headless browsing but I found a way to keep firefox. I guess all this means is that I can't use AutoIT to fix my problem.

The popups you see are not regular popups that can be interacted with using switch_to. These popups are system dialogs and cannot be automated using selenium.
Usually people avoid having these dialogs shown in the first place by tweaking browser preferences, e.g.:
downloading file using selenium
Access to file download dialog in Firefox
How to download a file using Selenium's WebDriver?
For uploading, usually you can find the appropriate input element and send keys to it with a path to the file:
How to upload file ( picture ) with selenium, python
How to upload files into file inputs? (python-selenium docs)
Let me know if your case cannot be solved by using the answers in the links I've attached.
As for your first, "download file automatically" problem, you just need to set a correct content-type:
profile.set_preference('browser.helperApps.neverAsk.saveToDisk', "application/xml,text/xml")
Second problem fix (upload part):
driver.find_element_by_name("input_product_list").send_keys(textpath)
driver.find_element_by_name('include_sr').click()
driver.find_element_by_id('submit').click()

Extremely simple implementation using AutoIT.
Below scripts and steps can hely you to click on 'Save>OK' on this window's pop-UP
Step 1: Download AutoIT package/tool here AutoIt You may select ZIP format (extract it)
Step 2: Open any text editor (say notepad) and copy below code and save with extension .au3
(e.g file.au3)
WinWait("[TITLE:Opening ; CLASS:MozillaDialogClass]","", 10)
If WinExists("[TITLE:Opening ; CLASS:MozillaDialogClass]") Then
WinActivate("[TITLE:Opening ; CLASS:MozillaDialogClass]")
Send("{DOWN}")
Sleep(20)
Send("{TAB}")
Sleep(20)
Send("{TAB}")
Sleep(20)
Send("{ENTER}")
EndIf
Step 3: From extracted zip (Step 1) look for folder named: Aut2Exe and open it
Step 4: Click Aut2exe_x64.exe if your OS is 64 bit otherwise click Aut2exe.exe
Step 5: Browse/Locate file created in Step2. (file saved as extension .a3)
AND Choose Destination (.exe/.a3x) and select .exe option (say file.exe)
AND then Click convert
Step 6: include this file.exe in your project folder and use it as per your requirement using below code (as it is in Eclipse):
driver.dwonload().click(); // it can be something else as per your flow
Runtime.getRuntime().exec("C:/*path_to_your_EXE_file(selected in step 6))*/file.exe");

Define download directory for chromedriver selenium with python

Everything is in the title!
Is there a way to define the download directory for selenium-chromedriver used with python?
In spite of many research, I haven't found something conclusive...
As a newbie, I've seen many things about "the desired_capabilities" or "the options" for Chromedriver but nothing has resolved my problem... (and I still don't know if it will!)
To explain a little bit more my issue:
I have a lot of url to scan (200 000) and for each url a file to download.
I have to create a table with the url, the information i scrapped on it, AND the name of the file I've just downloaded for each webpage.
With the volume I have to treat, I've created threads that open multiple instance of chromedriver to speed up the treatment.
The problem is that every downloaded file arrives in the same default directory and I'm no more able to link a file to an url...
So, the idea is to create a download directory for every thread to manage them one by one.
If someone have the answer to my question in the title OR a workaround to identify the file downloaded and link it with the current url, I will be grateful!

For chromedriver1 create a new profile, and inside that profile set download.default_directory to the desired location, and set this profile for chrome using chrome.profile. The selenium-chromedriver package should have some methods for creating new profiles (at least it does with ruby), as they need some special handling.
Chromedriver2 doesn't support setting the profile. You can set preferences with it. If you want to set the download directory this is how you do it:
prefs: { download: { default_directory: "/tmp" } }
The ruby selenium-webdriver doesn't support this feature yet, the python variant might do however.

I have faced recently the same issue. Tried a lot of solutions found in the Internet, no one helped. So finally I came to this:
Launch chrome with empty user-data-dir (in /tmp folder) to let chrome initialize it
Quit chrome
Modify Default/Preferences in newly created user-data-dir, add those fields to the root object (just an example):
"download": {
"default_directory": "/tmp/tmpX7EADC.downloads",
"directory_upgrade": true
}
Launch chrome again with the same user-data-dir
Now it works just fine.
Another tip: If you don't know file name of file that is going to be downloaded, create snapshot (list of files) of downloads directory, then download the file and find its name by comparin snapshot and current list of files in the downloads directory.

Please try the below code....
System.setProperty("webdriver.chrome.driver", "/path/to/chromedriver");
String downloadFilepath = "/path/to/download";
HashMap<String, Object> chromePrefs = new HashMap<String, Object>();
chromePrefs.put("profile.default_content_settings.popups", 0);
chromePrefs.put("download.default_directory", downloadFilepath);
ChromeOptions options = new ChromeOptions();
HashMap<String, Object> chromeOptionsMap = new HashMap<String, Object>();
options.setExperimentalOptions("prefs", chromePrefs);
options.addArguments("--test-type");
DesiredCapabilities cap = DesiredCapabilities.chrome();
cap.setCapability(ChromeOptions.CAPABILITY, chromeOptionsMap);
cap.setCapability(CapabilityType.ACCEPT_SSL_CERTS, true);
cap.setCapability(ChromeOptions.CAPABILITY, options);
WebDriver driver = new ChromeDriver(cap);

How to determine the size of html table in pixels given an html file

I have a html file that has various html tags in it. This html also has a bunch of tables in it. I am processing this file using python. How do I find out what the size (length x width in pixels) when it is rendered by a browser (preferably chrome or firefox)?
I am essentially looking for the information when you do "inspect element" on a browser, and you are able to see the size of the various elements. I want to access this size in my python code.
I am using lxml to parse my html and can use selenium if needed.
edit: added #node.js incase I can use it to spit out the size of all the tables in a shell script and I can grab it in python.

You're going to want to use Selenium WebDriver to open the HTML file in an actual browser installed on the computer that your Python code is running on.
I'm not sure how you'd use the Selenium WebDriver API to find out how tall a rendered table is, but the value_of_css_property method might do it.

If you can call out shellscript, and you can use Node.js, I'm assuming you could also install and use PhantomJS, which is a headless WebKit port. (I.e. an actual honest to goodness WebKit renderer that just doesn't require a window to work.) This will let you use Javascript and the familiar web libraries to manipulate the document. As an example, the following gets you the width of the logo element towards the upper left Stack Overflow site:
page = require('webpage').create(); // create a new "browser"
page.open('http://stackoverflow.com/', function() {
// callback when loading completes
var logoWidth = page.evaluate(function() {
// This runs in the rendered page and uses the version of jQuery that SO loads.
return $('#hlogo').width();
});
console.log(logoWidth); // prints 250, the same as Chrome.
phantom.exit(); // for some reason you need to exit manually
});
The documentation for PhantomJS will tell you more about what you can do with it and how.
One caveat however is that loading a page takes a while, since it needs to fetch CSS and scripts and generally do everything a browser does. I'm not sure if and how PhantomJS does any caching, if it does it might make sense to reuse the same process for multiple scrapes of the same site.

Disable firefox save as dialog-selenium

I am web scraping with selenium and whenever i try to download i file the firefox download/save as file pops up however, even If i apply profile.set_preference('browser.helperApps.neverAsk.saveToDisk', "application/csv"), it still doesnt work, I have tried everyt .csv related MIME but doesn't work, is it possible to either click save as radio button and then click ok on the dialog or disable it entirely.

you should do two things, first set these three preferences as follows (this is in Java but I guess you manage to translate that to python :-):
profile.setPreference("browser.download.dir", "c:/yourDownloadDir");
profile.setPreference("browser.download.folderList", 2);
profile.setPreference("browser.helperApps.neverAsk.saveToDisk", "application/csv, text/csv");
secondly, you should make sure the download file has the desired mime type. To do that, you can use the web developer tools and inspect the download.
EDIT:
To find out the MIME type open Chrome, press Ctrl+Shift+I (Cmd+Alt+I on Mac OS) change to the 'Network' tab and click your download link. You should see something like this:

Just an additional answer that might help someone, as comments to the accepted answer put me on the right track (thanks!). Another MIME type of CSV you might be dealing with is application/x-csv - that was my case and once I looked it up in the Network tab of the browser, I became a happier man :)

In C#
FirefoxOptions options = new FirefoxOptions();
options.SetPreference("browser.download.folderList", 2);
options.SetPreference("browser.download.manager.showWhenStarting", false);
options.SetPreference("browser.download.dir", "c:\\temp");
options.SetPreference("browser.helperApps.neverAsk.saveToDisk", "text/csv");

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.