Download and check if file exists - python

I'm trying to get a similar script. VIDEO
No clue how he did this, working in every browser.
I'm not sure how to grab and check the url filename.
1 - open a text file containing a list with URLs (example.com/file.exe, anotherurl.com/file2.exe)
2 - for every url, open a browser tab, download the file and check if file exists
3 - Print "file downloaded" else "download failed" or calculate the fail ratio
My code, it worked for a single url when I know the filename. I've been trying to make it work for url lists. It should grab the filename in the url path /file.exe
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
import os.path
import requests
op = webdriver.ChromeOptions()
p = {'download.default_directory': 'C:\\Users\\VM\\Downloads'}
op.add_experimental_option('prefs', p)
driver = webdriver.Chrome(executable_path="C:\\chromedriver.exe",
options=op)
with open("C:\\Users\\VM\\Downloads\\urls.txt",'r') as file:
for url in file.readlines():
driver.get(url);
time.sleep(5)
if os.path.isfile('CHECK URL PATH FILENAME'):
print("File download is completed")
else:
print("File download is not completed")

What you can do is every time before you download file clean a folder(Delete all files) and then after hitting url check if atleast one file is present if file is present you can check file name contains '.exe'
// clean folder by deleting all files from download location folder
driver.get(url);
time.sleep(5)
if os.path.isfile('CHECK URL PATH FILENAME'): // get list of files present in folder ane verify if file name ends with .exe
print("File download is completed") // Write code to delete all files after this line.
else:
print("File download is not

Related

Trying to take a screenshot with selenium (while selecting the url from a text file)

So, i started trying to use selenium today and i have a txt file with a lot of urls
The goal here is to select one line of this txt, open the browser and take a picture, after that, save with the name of the url (i want to see what i have on those txts without open one by one, i know it's really lazy)
The problem here is, idk if it works and what i am missing (for sure i am missing something here)
from selenium import webdriver
from time import sleep
myfile = open("links.txt", "r")
url = myfile.readline()
while url:
url = myfile.readline()
driver = webdriver.Firefox()
driver.get(url)
sleep(1)
driver.get_screenshot_as_file(url + ".png")
driver.quit()
print("printed!")
myfile.close()
That's the error that i get:
selenium.common.exceptions.InvalidArgumentException: Message: Malformed URL: URL constructor: is not a valid URL.
Stacktrace:
RemoteError#chrome://remote/content/shared/RemoteError.sys.mjs:8:8
WebDriverError#chrome://remote/content/shared/webdriver/Errors.sys.mjs:182:5
InvalidArgumentError#chrome://remote/content/shared/webdriver/Errors.sys.mjs:311:5
GeckoDriver.prototype.navigateTo#chrome://remote/content/marionette/driver.sys.mjs:823:11

Trouble renaming downloaded file from a folder

I've written a script in python in combination with selenium to download a file from a webpage by initiating a click on that file's link. When i run my script, the file seems to get downloaded in the predefined folder.
The problem is that I can't find any idea to rename the downloaded file. FYC there may be multiple files in that folder. I would like to rename the downloaded file to the variable newname in the script.
How can I rename a downloaded file from a folder?
This is I've written so far:
import os
from selenium import webdriver
url = "https://www.online-convert.com/file-format/docx"
folder_location = r"C:\Users\WCS\Desktop\file_storage"
newname = "document.docx"
def download_n_rename_file(link):
driver.get(link)
driver.find_element_by_css_selector("a[href$='example_multipage.docx']").click()
#how to rename the downloaded file to "document.docx"
#os.rename()
if __name__ == '__main__':
chromeOptions = webdriver.ChromeOptions()
prefs = {'download.default_directory': folder_location}
chromeOptions.add_experimental_option('prefs', prefs)
driver = webdriver.Chrome(chrome_options=chromeOptions)
download_n_rename_file(url)
I am assuming the file you've downloaded is named under example_multipage.docx:
import os
from selenium import webdriver
url = "https://www.online-convert.com/file-format/docx"
folder_location = r"C:\Users\WCS\Desktop\file_storage"
newname = "document.docx"
def download_n_rename_file(link):
driver.get(link)
driver.find_element_by_css_selector("a[href$='example_multipage.docx']").click()
# To rename the downloaded file to "document.docx"
os.rename('example_multipage.docx',newname)
if __name__ == '__main__':
chromeOptions = webdriver.ChromeOptions()
prefs = {'download.default_directory': folder_location}
chromeOptions.add_experimental_option('prefs', prefs)
driver = webdriver.Chrome(chrome_options=chromeOptions)
download_n_rename_file(url)
EDIT:
OP: but the problem is there is no such existing name in advance.
This makes me think, what if we could find when a file has downloaded successfully and then grab its name? But, wait. that is not possible!
Or could there be a way to detect the name of a downloaded file? But, wait. You don't have control over the download file naming through selenium.

Saving embedded .pdf not as .pdf file

I'm trying to download a embedded pdf from the chrome browser with below code, however the file is being stored on my C:\ drive as a the following file: C:\TEST_A_15.pdf.crdownload.
def download_pdf(lnk):
from selenium import webdriver
from time import sleep
options = webdriver.ChromeOptions()
download_folder = "C:\\"
profile = {"plugins.plugins_list": [{"enabled": False,
"name": "Chrome PDF Viewer"}],
"download.default_directory": download_folder,
"download.extensions_to_open": ""}
options.add_experimental_option("prefs", profile)
print("Downloading file from link: {}".format(lnk))
driver = webdriver.Chrome(chrome_options = options)
driver.get(lnk)
filename = lnk.split("=")[3]
print("File: {}".format(filename))
print("Status: Download Complete.")
print("Folder: {}".format(download_folder))
driver.close()
If I adjust the line for filename to what's below, then I get the C:\TEST_A_15.pdf file desired on my harddrive without the .crdownload at the end. But then I get a IndexError: list index out of range which is logical because the "=" is not be found in position 4.
filename = lnk.split("=")[4]
The URL used (I changed the hostname and name of pdf file so URL don't work):
https://testing.nl/getpdf.asp?id=ORsP5UqX6IikuikcGiLD&unique=adda3b24-f9ca-4007-898a-caed5309c140&filename=TEST_A_15.pdf
Even more strange when I use a network drive together with the filename = lnk.split("=")[3] then the file will be stored as a .tmp file i.e.: 2498d715-84aa-4e81-8037-264bb0211b4b.tmp and when I use the incorrect code (filename = lnk.split("=")[4]) it gives the IndexError but saves the file correctly as .pdf file on the network drive.
I've solved it, the problem was that the webdriver closed before the entire pdf was downloaded resulting in .tmp or .crdownload files. So I built in a sleep before closing the driver.

How to read a file downloaded by selenium webdriver in python

I am using selenium with webdriver in python to download a csv file from a site . The file gets downloaded into the download directory specified. Here is an overview of my code
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.folderList", 2)
fp.set_preference("browser.download.manager.showWhenStarting", False)
fp.set_preference("browser.download.dir",'xx/yy')
fp.set_preference('browser.helperApps.neverAsk.saveToDisk', "text/plain, application/vnd.ms-excel, text/csv, text/comma-separated-values, application/octet-stream")
driver = webdriver.Firefox(fp)
driver.get('url')
I need to print the contents of this csv to the terminal . A lot of similar files with random names will be downloaded into the same folder so accessing the file via filename wont work as I don't know what it will be in advance
You can get the last downloaded file from that location and then read the file:
path = /path to folder
list = os.listdir(path)
time_sorted_list = sorted(list, key=os.path.getmtime)
file_name = time_sorted_list[len(time_sorted_list)-1]
and then u can read from this file. Hoping not multiple files are getting there by parallel processes.
EDIT:
Just saw comment that multiple instances are up for downloading, so other way around you can use urllib and download the file by using its url as:
import urllib
urllib.urlretrieve( "http://www.example.com/yourfile.ext", "your-file-name.ext") // you can provide unique-id to your file name
This answer was formed from a combination of previous stack overflow questions , answers as well as comments in this post so thank you everyone.
I combined selenium webdriver and the python requests module for this solution . I essentially logged into the site using selenium, copied the cookies from the webdriver session and then used a requests.get(url,cookies = webdriver_cookies) to get the file.
Here's the gist of my solution
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.folderList", 2)
fp.set_preference("browser.download.manager.showWhenStarting", False)
fp.set_preference("browser.download.dir",'xx/yy')
fp.set_preference('browser.helperApps.neverAsk.saveToDisk', "text/plain, application/vnd.ms-excel, text/csv, text/comma-separated-values, application/octet-stream")
driver = webdriver.Firefox(fp)
# selenium login code ...
driver_cookies = driver.get_cookies()
cookies_copy = {}
for driver_cookie in driver_cookies:
cookies_copy[driver_cookie["name"]] = driver_cookie["value"]
r = requests.get('url',cookies = cookies_copy)
print r.text
I hope that this helps someone
Downloading files in Selenium is never a good idea. You cannot control where and under which filename the file is downloaded, and if you want to find out, then you have to use dirty hacks. It depends on the browser and its settings and if the same file has already been downloaded before or not.
Plus, you have to take care of deleting the file after the download, bc otherwise, numerous copies of the same file will spam your hard drive until it's completely full.
If possible, you should call something like
string downloadUrl = ButtonDownloadPdf.GetAttribute("href");
and then handle the downloading yourself, using conventional methods, not Selenium.

Loop through downloading files using selenium in Python

This is a follow-up question to this previous question on how to download ~1000 files from Google Patents.
I would like to iterate through a list of filenames fname = ["ipg150106.zip", "ipg150113.zip"] and simulate clicking and saving these files to my computer. The following example works for me and downloads a single file:
from selenium import webdriver
from selenium.webdriver.firefox.firefox_profile import FirefoxProfile
# Define parameters
savepath = 'D:\\' # set the desired path here for the files
# Download the files from Google Patents
profile = FirefoxProfile ()
profile.set_preference("browser.download.panel.shown", False)
profile.set_preference("browser.download.folderList", 2) # 2 means specify custom location
profile.set_preference("browser.download.manager.showWhenStarting", False)
profile.set_preference("browser.download.dir", savepath) # choose folder to download to
profile.set_preference("browser.helperApps.neverAsk.saveToDisk",'application/octet-stream')
driver = webdriver.Firefox(firefox_profile=profile)
url = 'https://www.google.com/googlebooks/uspto-patents-grants-text.html#2015'
driver.get(url)
filename = driver.find_element_by_xpath('//a[contains(text(), "ipg150106.zip")]')
filename.click()
I've tried to replace this with a list and a loop like this:
fname = ["ipg150106.zip", "ipg150113.zip"]
for f in fname:
filename = driver.find_element_by_xpath('//a[contains(text(), f)]')
filename.click()
print('Finished loop for: {}.'.format(f))
However, the browser opens, but nothing happens (no clicking on files). Any ideas?
You need to pass the filename into the XPath expression:
filename = driver.find_element_by_xpath('//a[contains(text(), "{filename}")]'.format(filename=f))
Though, an easier location technique here would be "by partial link text":
for f in fname:
filename = driver.find_element_by_partial_link_text(f)
filename.click()

Categories

Resources