Python Selenium: Firefox set_preference to overwrite files on download? - python

I am using these Firefox preference setting for selenium in Python 2.7:
ff_profile = webdriver.FirefoxProfile(profile_dir)
ff_profile.set_preference("browser.download.folderList", 2)
ff_profile.set_preference("browser.download.manager.showWhenStarting", False)
ff_profile.set_preference("browser.download.dir", dl_dir)
ff_profile.set_preference('browser.helperApps.neverAsk.saveToDisk', "text/plain, application/vnd.ms-excel, text/csv, text/comma-separated-values, application/octet-stream")
With Selenium, I want to recurringly download the same file, and overwrite it, thus keeping the same filename – without me having to confirm the download.
With the settings above, it will download without asking for location, but all downloads will creates duplicates with the filename filename (1).ext, filename (2).ext etc in MacOS.
I'm guessing there might not be a setting to allow overwriting from within Firefox, to prevent accidents(?).
(In that case, I suppose the solution would be to handle the overwriting on the disk with other Python modules; another topic).

This is something that is out of the Selenium's scope and is handled by the operating system.
Judging by the context of this and your previous question, you know (or can determine from the link text) the filename beforehand. If this is really the case, before hitting the "download" link, make sure you remove the existing file:
import os
filename = "All-tradable-ETFs-ETCs-and-ETNs.xlsx" # or extract it dynamically from the link
filepath = os.path.join(dl_dir, filename)
if os.path.exists(filepath):
os.remove(filepath)

Related

How do I tell a .py script to look for folders and file in the directory it is in?

I have a .py script that pulls in data from Google Sheets and outputs it in a yaml format. This is for my Hugo powered website which is served via Netlify. As I understand it, Netlify is capable of running Python too so I thought I could upload the web content and the python file in the same directory. This is required for updating the content dynamically, and I expect the python file to run everytime I trigger a build for the website. However, the python file requires certain credentials to work.
My code currently looks like this:
# Set location to write new files to.
outputpath = Path("D:/content/submission/")
#Read JSON:
json = Path("D:/credentials.json")
These are hardcoded local paths. When I bundle the script in the website directory, what paths should I write in so that when the script runs, these files are read in and outputted correctly?
I would want to output in my content/submission folder and read in from my creds/credentials.json. Should I just put these paths in? Will it know that it has to look within the directory for these folders, or is there something I need to add to the script that tells it to work within the directory it is sitting in?
🧨 First, credentials and secrets are best kept out of files (and esp, esp, source control).
For general file locations however, you can use something like:
pa_same_but_json = Path(__file__).with_suffix(".json")
pa_same_directory = Path(__file__).parent / "nosecrets.json")
To answer your comments:
(Mind you not 100% sure about Window Drives in the following):
parent
is an attribute on Path objects allowing you to "climb up" hierarchy.
Path("c:\temp\foo.json").parent returns same as Path("c:\temp")
Yes, you can do mypath.parent.parent
/ is a path concatenation operator
when applied to Path objects
So
myfile = os.path.join(["c:", "temp", "foo.json"])
and
myfile_as_a_Path = Path("c:") / "temp" / "foo.json"
are the same, except for one being a string, the other a Path instance. Once the first Path has been built (on C:) the rest of the code "knows that it operating on Path instances" and re-purposes the division operator support (probably some magic __div__ method intended for instance math ) to support path concatenation. This happens because most operations on Path instances return another Path, allowing you to do this type of chaining.
It's best not to write way up the hierarchy in a hosted/VM context (you never know directory structure above or if you have permissions), but something based on your script location might be
pa_current = Path(__file__).parent
# could `content/submission` but that's assuming you're always on Posix
# systems. Letting Pathlib do the work is safer, even if Windows probably
# puts up with `/`
pa_write = pa_current / "content" / "submission"
pa_read = pa_current / "credentials.json"
These at this points are Path instances, but really not much different than strings except having smarter methods to manipulate them. They don't know or care if the files exist or not.
P.S.
🧨 A consideration is that, in many web contexts, writing to code directories (like what happens in a content/submission under the python scripts) is a security goof as well.
Maybe pa_write = pa_current.parent.parent / "uploads" / "content" / "submission" would be better.
Specifically when it comes to user uploads and secrets, please refer to best practices for your platform, not just what Python can do. This answer was about pathlib.Path, not Hugo uploads.

How to open different html file in same window using Python

I have folder which is named as 'Doc', The Doc folder contains some sub folders each folder has '.html' file. I have to open all at time in web breowser using Python code. I opened it but the problem is, the html files is not opening in the same window with new tab. Some times each file is opening in new window. I dont know what is the exact problem. Here, is the code which I tried
import os
import webbrowser
for root, dirs, files in os.walk("Doc"):
for file in files:
if file.endswith("index.html"):
webbrowser.open_new_tab(os.path.join(root, file))
RTFM (https://docs.python.org/2/library/webbrowser.html )
webbrowser.open_new_tab(url)
Open url in a new page (“tab”) of the default browser, if possible, otherwise equivalent to open_new().
It is a best effort only, presumably depending on the browser too.

accessing source code from a local file python

I need to access the source code of a locally saved file, but I need to automate this because there are multiple files in one folder. I've looked at the inspect module and the selenium module, but I still understand what to do. After accessing the source code, I need to use bs4 to extract from it.
I've read several posts on here and elsewhere with similar problems, but the thing is that my file does not open in the source code (it's written in xml and so far everything needs to be in source code before you can use these modules). If I open the file, it just uses my browser to open a regular page and then I have to click view page source.
How can I automate this so that it will open the page, go to the source code, and save it so I can stick it into a soup for later parsing?
path_g_jurt = r'C:\Users\g\Desktop\t\SDU\jurt htmls\jurt\meta jurt'
file = r'C:\Users\g\Desktop\t\SDU\jurt htmls\jurt\meta jurt' + "/" + file
for file in path_g_jurt:
if file.endswith(".xhtml"):
with open(file, encoding = "utf-8") as mdata_jurt:
soup = BeautifulSoup(mdata_jurt)
main = file.find("jcid").get_text()
misc_links = []
for item in file.find_all("regelgeving"):
misc = item.find("misc:link")
misc_links.append(misc.get("misc:jcid"))
Any help would be appreciated.

Naming a file when downloading with Selenium Webdriver

I see that you can set where to download a file to through Webdriver, as follows:
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.folderList",2)
fp.set_preference("browser.download.manager.showWhenStarting",False)
fp.set_preference("browser.download.dir",getcwd())
fp.set_preference("browser.helperApps.neverAsk.saveToDisk","text/csv")
browser = webdriver.Firefox(firefox_profile=fp)
But, I was wondering if there is a similar way to give the file a name when it is downloaded? Preferably, probably not something that is associated with the profile, as I will be downloading ~6000 files through one browser instance, and do not want to have to reinitiate the driver for each download.
I would suggest a little bit strange way: do not download files with the use of Selenium if possible.
I mean get the file URL and use urllib library to download the file and save it to disk in a 'manual' way. The issue is that selenium doesn't have a tool to handle Windows dialogs, such as 'save as' dialog. I'm not sure, but I doubt that it can handle any OS dialogs at all, please correct me I'm wrong. :)
Here's a tiny example:
import urllib
urllib.urlretrieve( "http://www.yourhost.com/yourfile.ext", "your-file-name.ext")
The only job for us here is to make sure that we handle all the urllib Exceptions. Please see http://docs.python.org/2/library/urllib.html#urllib.urlretrieve for more info.
I do not know if there is a pure Selenium handler for this, but here is what I have done when I needed to do something with the downloaded file.
Set a loop that polls your download directory for the latest file that does not have a .part extension (this indicates a partial download and would occasionally trip things up if not accounted for. Put a timer on this to ensure that you don't go into an infinite loop in the case of timeout/other error that causes the download not to complete. I used the output of the ls -t <dirname> command in Linux (my old code uses commands, which is deprecated so I won't show it here :) ) and got the first file by using
# result = output of ls -t
result = result.split('\n')[1].split(' ')[-1]
If the while loop exits successfully, the topmost file in the directory will be your file, which you can then modify using os.rename (or anything else you like).
Probably not the answer you were looking for, but hopefully it points you in the right direction.
Solution with code as suggested by the selected answer. Rename the file after each one is downloaded.
import os
os.chdir(SAVE_TO_DIRECTORY)
files = filter(os.path.isfile, os.listdir(SAVE_TO_DIRECTORY))
files = [os.path.join(SAVE_TO_DIRECTORY, f) for f in files] # add path to each file
files.sort(key=lambda x: os.path.getmtime(x))
newest_file = files[-1]
os.rename(newest_file, docName + ".pdf")
This answer was posted as an edit to the question naming a file when downloading with Selenium Webdriver by the OP user1253952 under CC BY-SA 3.0.

Python3:Save File to Specified Location

I have a rather simple program that writes HTML code ready for use.
It works fine, except that if one were to run the program from the Python command line, as is the default, the HTML file that is created is created where python.exe is, not where the program I wrote is. And that's a problem.
Do you know a way of getting the .write() function to write a file to a specific location on the disc (e.g. C:\Users\User\Desktop)?
Extra cool-points if you know how to open a file browser window.
The first problem is probably that you are not including the full path when you open the file for writing. For details on opening a web browser, read this fine manual.
import os
target_dir = r"C:\full\path\to\where\you\want\it"
fullname = os.path.join(target_dir,filename)
with open(fullname,"w") as f:
f.write("<html>....</html>")
import webbrowser
url = "file://"+fullname.replace("\\","/")
webbrowser.open(url,True,True)
BTW: the code is the same in python 2.6.
I'll admit I don't know Python 3, so I may be wrong, but in Python 2, you can just check the __file__ variable in your module to get the name of the file it was loaded from. Just create your file in that same directory (preferably using os.path.dirname and os.path.join to remain platform-independent).

Categories

Resources