I see that you can set where to download a file to through Webdriver, as follows:
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.folderList",2)
fp.set_preference("browser.download.manager.showWhenStarting",False)
fp.set_preference("browser.download.dir",getcwd())
fp.set_preference("browser.helperApps.neverAsk.saveToDisk","text/csv")
browser = webdriver.Firefox(firefox_profile=fp)
But, I was wondering if there is a similar way to give the file a name when it is downloaded? Preferably, probably not something that is associated with the profile, as I will be downloading ~6000 files through one browser instance, and do not want to have to reinitiate the driver for each download.
I would suggest a little bit strange way: do not download files with the use of Selenium if possible.
I mean get the file URL and use urllib library to download the file and save it to disk in a 'manual' way. The issue is that selenium doesn't have a tool to handle Windows dialogs, such as 'save as' dialog. I'm not sure, but I doubt that it can handle any OS dialogs at all, please correct me I'm wrong. :)
Here's a tiny example:
import urllib
urllib.urlretrieve( "http://www.yourhost.com/yourfile.ext", "your-file-name.ext")
The only job for us here is to make sure that we handle all the urllib Exceptions. Please see http://docs.python.org/2/library/urllib.html#urllib.urlretrieve for more info.
I do not know if there is a pure Selenium handler for this, but here is what I have done when I needed to do something with the downloaded file.
Set a loop that polls your download directory for the latest file that does not have a .part extension (this indicates a partial download and would occasionally trip things up if not accounted for. Put a timer on this to ensure that you don't go into an infinite loop in the case of timeout/other error that causes the download not to complete. I used the output of the ls -t <dirname> command in Linux (my old code uses commands, which is deprecated so I won't show it here :) ) and got the first file by using
# result = output of ls -t
result = result.split('\n')[1].split(' ')[-1]
If the while loop exits successfully, the topmost file in the directory will be your file, which you can then modify using os.rename (or anything else you like).
Probably not the answer you were looking for, but hopefully it points you in the right direction.
Solution with code as suggested by the selected answer. Rename the file after each one is downloaded.
import os
os.chdir(SAVE_TO_DIRECTORY)
files = filter(os.path.isfile, os.listdir(SAVE_TO_DIRECTORY))
files = [os.path.join(SAVE_TO_DIRECTORY, f) for f in files] # add path to each file
files.sort(key=lambda x: os.path.getmtime(x))
newest_file = files[-1]
os.rename(newest_file, docName + ".pdf")
This answer was posted as an edit to the question naming a file when downloading with Selenium Webdriver by the OP user1253952 under CC BY-SA 3.0.
Related
I am trying to rename a file, but python cannot find the file specified.
I have a file located here:
C:\Users\my_username\Desktop\selenium_downloads\close_of_day_reports\close-of-day-2022-04-24-2022-04-23.pdf
I am trying to rename the file to test.pdf
Here is the code I am using:
import os
os.rename(
src = "C:\\Users\\my_username\\Desktop\\selenium_downloads\\close_of_day_reports\\close-of-day-2022-04-24-2022-04-23.pdf",
dst = "C:\\Users\\my_username\\Desktop\\selenium_downloads\\close_of_day_reports\\test.pdf"
)
The error message I am getting is:
FileNotFoundError: [WinError 2] The system cannot find the file specified: 'C:\\Users\\my_username\\Desktop\\selenium_downloads\\close_of_day_reports\\close-of-day-2022-04-24-2022-04-23.pdf' ->
'C:\\Users\\my_username\\Desktop\\selenium_downloads\\close_of_day_reports\\test.pdf'
What am I doing wrong?
Edit #1:
The original file was not deleted, it still exists.
It's really strange, when I run it the first time, the file does not get renamed, but when I run it again, it does.
Weird, for some reason it works in Python Shell, but not my Python file.
Edit #2:
I am using Selenium to download the file. When I comment the part of my code out that downloads the file from Selenium, my os.rename code works fine. Weird.
Based off the error, I think you are still leaving one the original name or not changing the right one.
import os
os.rename(
src = "C:\\Users\\my_username\\Desktop\\selenium_downloads\\close_of_day_reports\\test.pdf",
dst = "C:\\Users\\my_username\\Desktop\\selenium_downloads\\close_of_day_reports\\close-of-day-2022-04-24-2022-04-23.pdf"
)
I'm pretty sure you ran the code once, renamed the file, and now it won't run again because you already renamed it.
Careful reading is your friend. Computers don't know or care what you meant:
FileNotFoundError: [WinError 2] The system cannot find the file specified: 'C:\Users\my_usernamer\Desktop\selenium_downloads\close_of_day_reports\close-of-day-2022-04-24-2022-04-23.pdf'
See the stray r in the path?
Found the solution:
When you download files using Selenium, you need to put in a sleep method for a few seconds and then you can download/move files without a problem.
Put this in your code after downloading the file, before downloading another:
from time import sleep
sleep(10)
pass
You may need to increase the sleep value, but 10 worked for me. The number inside of sleep represents seconds, so sleep(10) means to wait 10 seconds.
I am using libtorrent for python 3.6 . I just want to get any file names that downloaded with a session, e.g. the folder name, the files name etc.
I searched around the web didn't come across anything. I am using the follow example:
https://www.libtorrent.org/python_binding.html
So when the download progress finish, i want to know what files this session downloaded. How can achieve that? Thanks in advance!
Finally found the answer, the code is:
handle = libtorrent.add_magnet_uri(session, magnetLink,params)
session.start_dht()
while not handle.has_metadata():
time.sleep(1)
torinfo = handle.get_torrent_info()
for x in range(torinfo.files().num_files()):
print(torinfo.files().file_path(x))
The code above prints the file names that came with the magnet file.
I am using these Firefox preference setting for selenium in Python 2.7:
ff_profile = webdriver.FirefoxProfile(profile_dir)
ff_profile.set_preference("browser.download.folderList", 2)
ff_profile.set_preference("browser.download.manager.showWhenStarting", False)
ff_profile.set_preference("browser.download.dir", dl_dir)
ff_profile.set_preference('browser.helperApps.neverAsk.saveToDisk', "text/plain, application/vnd.ms-excel, text/csv, text/comma-separated-values, application/octet-stream")
With Selenium, I want to recurringly download the same file, and overwrite it, thus keeping the same filename – without me having to confirm the download.
With the settings above, it will download without asking for location, but all downloads will creates duplicates with the filename filename (1).ext, filename (2).ext etc in MacOS.
I'm guessing there might not be a setting to allow overwriting from within Firefox, to prevent accidents(?).
(In that case, I suppose the solution would be to handle the overwriting on the disk with other Python modules; another topic).
This is something that is out of the Selenium's scope and is handled by the operating system.
Judging by the context of this and your previous question, you know (or can determine from the link text) the filename beforehand. If this is really the case, before hitting the "download" link, make sure you remove the existing file:
import os
filename = "All-tradable-ETFs-ETCs-and-ETNs.xlsx" # or extract it dynamically from the link
filepath = os.path.join(dl_dir, filename)
if os.path.exists(filepath):
os.remove(filepath)
I need to access the source code of a locally saved file, but I need to automate this because there are multiple files in one folder. I've looked at the inspect module and the selenium module, but I still understand what to do. After accessing the source code, I need to use bs4 to extract from it.
I've read several posts on here and elsewhere with similar problems, but the thing is that my file does not open in the source code (it's written in xml and so far everything needs to be in source code before you can use these modules). If I open the file, it just uses my browser to open a regular page and then I have to click view page source.
How can I automate this so that it will open the page, go to the source code, and save it so I can stick it into a soup for later parsing?
path_g_jurt = r'C:\Users\g\Desktop\t\SDU\jurt htmls\jurt\meta jurt'
file = r'C:\Users\g\Desktop\t\SDU\jurt htmls\jurt\meta jurt' + "/" + file
for file in path_g_jurt:
if file.endswith(".xhtml"):
with open(file, encoding = "utf-8") as mdata_jurt:
soup = BeautifulSoup(mdata_jurt)
main = file.find("jcid").get_text()
misc_links = []
for item in file.find_all("regelgeving"):
misc = item.find("misc:link")
misc_links.append(misc.get("misc:jcid"))
Any help would be appreciated.
I have a rather simple program that writes HTML code ready for use.
It works fine, except that if one were to run the program from the Python command line, as is the default, the HTML file that is created is created where python.exe is, not where the program I wrote is. And that's a problem.
Do you know a way of getting the .write() function to write a file to a specific location on the disc (e.g. C:\Users\User\Desktop)?
Extra cool-points if you know how to open a file browser window.
The first problem is probably that you are not including the full path when you open the file for writing. For details on opening a web browser, read this fine manual.
import os
target_dir = r"C:\full\path\to\where\you\want\it"
fullname = os.path.join(target_dir,filename)
with open(fullname,"w") as f:
f.write("<html>....</html>")
import webbrowser
url = "file://"+fullname.replace("\\","/")
webbrowser.open(url,True,True)
BTW: the code is the same in python 2.6.
I'll admit I don't know Python 3, so I may be wrong, but in Python 2, you can just check the __file__ variable in your module to get the name of the file it was loaded from. Just create your file in that same directory (preferably using os.path.dirname and os.path.join to remain platform-independent).