Loop through downloading files using selenium in Python - python

This is a follow-up question to this previous question on how to download ~1000 files from Google Patents.
I would like to iterate through a list of filenames fname = ["ipg150106.zip", "ipg150113.zip"] and simulate clicking and saving these files to my computer. The following example works for me and downloads a single file:
from selenium import webdriver
from selenium.webdriver.firefox.firefox_profile import FirefoxProfile
# Define parameters
savepath = 'D:\\' # set the desired path here for the files
# Download the files from Google Patents
profile = FirefoxProfile ()
profile.set_preference("browser.download.panel.shown", False)
profile.set_preference("browser.download.folderList", 2) # 2 means specify custom location
profile.set_preference("browser.download.manager.showWhenStarting", False)
profile.set_preference("browser.download.dir", savepath) # choose folder to download to
profile.set_preference("browser.helperApps.neverAsk.saveToDisk",'application/octet-stream')
driver = webdriver.Firefox(firefox_profile=profile)
url = 'https://www.google.com/googlebooks/uspto-patents-grants-text.html#2015'
driver.get(url)
filename = driver.find_element_by_xpath('//a[contains(text(), "ipg150106.zip")]')
filename.click()
I've tried to replace this with a list and a loop like this:
fname = ["ipg150106.zip", "ipg150113.zip"]
for f in fname:
filename = driver.find_element_by_xpath('//a[contains(text(), f)]')
filename.click()
print('Finished loop for: {}.'.format(f))
However, the browser opens, but nothing happens (no clicking on files). Any ideas?

You need to pass the filename into the XPath expression:
filename = driver.find_element_by_xpath('//a[contains(text(), "{filename}")]'.format(filename=f))
Though, an easier location technique here would be "by partial link text":
for f in fname:
filename = driver.find_element_by_partial_link_text(f)
filename.click()

Related

Download and check if file exists

I'm trying to get a similar script. VIDEO
No clue how he did this, working in every browser.
I'm not sure how to grab and check the url filename.
1 - open a text file containing a list with URLs (example.com/file.exe, anotherurl.com/file2.exe)
2 - for every url, open a browser tab, download the file and check if file exists
3 - Print "file downloaded" else "download failed" or calculate the fail ratio
My code, it worked for a single url when I know the filename. I've been trying to make it work for url lists. It should grab the filename in the url path /file.exe
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
import os.path
import requests
op = webdriver.ChromeOptions()
p = {'download.default_directory': 'C:\\Users\\VM\\Downloads'}
op.add_experimental_option('prefs', p)
driver = webdriver.Chrome(executable_path="C:\\chromedriver.exe",
options=op)
with open("C:\\Users\\VM\\Downloads\\urls.txt",'r') as file:
for url in file.readlines():
driver.get(url);
time.sleep(5)
if os.path.isfile('CHECK URL PATH FILENAME'):
print("File download is completed")
else:
print("File download is not completed")
What you can do is every time before you download file clean a folder(Delete all files) and then after hitting url check if atleast one file is present if file is present you can check file name contains '.exe'
// clean folder by deleting all files from download location folder
driver.get(url);
time.sleep(5)
if os.path.isfile('CHECK URL PATH FILENAME'): // get list of files present in folder ane verify if file name ends with .exe
print("File download is completed") // Write code to delete all files after this line.
else:
print("File download is not

Python Selenium check for new links and put in text file if not exist

I would like to automate the following situation:
determine all links on a website
put them into one file
check if there are new links on the website (compare with the previous file)
if there are new links on the website, then put them in the file
Any ideas on how I could implement this? How should i save the links? (as json? simple?)
What I have so far:
links = driver.find_elements_by_css_selector("[href*='search']")
links2 = [elem.get_attribute('href') for elem in links]
print(links2)
Output:
['https://www.xyz/testing', 'https://www.xyz/testing2', 'hhttps://www.xyz/testing3']
Here's a Python solution that adds new links to a file called links.txt every time it is run. This assumes that you have selenium installed, and that you have chromedriver on your System PATH.
import codecs
import json
import os
from selenium import webdriver
from selenium.webdriver.common.by import By
# Spin up a new Chrome browser
options = webdriver.ChromeOptions()
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option("useAutomationExtension", False)
driver = webdriver.Chrome(options=options)
# Open the URL and then get all links from the page
driver.get("https://seleniumbase.io/demo_page")
links = driver.find_elements(By.CSS_SELECTOR, "a[href]")
links = [elem.get_attribute('href') for elem in links]
links = list(set(links)) # Remove duplicates
# Get existing links from "links.txt", then combine all links
abs_path = os.path.abspath(".")
file_name = "links.txt"
file_path = os.path.join(abs_path, file_name)
file_links = []
if os.path.exists(file_path):
with open(file_path, "r") as f:
json_file_links = f.read().strip()
file_links = json.loads(json_file_links)
all_links = list(set(links + file_links))
# Print each link on a new line
for link in all_links:
print(link)
# Save all links in json format into "links.txt"
json_all_links = json.dumps(all_links)
links_file = codecs.open(file_path, "w+", encoding="utf-8")
links_file.writelines(json_all_links)
links_file.close()
driver.quit() # Close the browser
Here's the printed output of that for the URL provided. (The links will also be in a file called links.txt.)
https://github.com/seleniumbase/SeleniumBase
https://seleniumbase.com/
https://seleniumbase.io/
https://seleniumbase.io/demo_page/

Trouble renaming downloaded file from a folder

I've written a script in python in combination with selenium to download a file from a webpage by initiating a click on that file's link. When i run my script, the file seems to get downloaded in the predefined folder.
The problem is that I can't find any idea to rename the downloaded file. FYC there may be multiple files in that folder. I would like to rename the downloaded file to the variable newname in the script.
How can I rename a downloaded file from a folder?
This is I've written so far:
import os
from selenium import webdriver
url = "https://www.online-convert.com/file-format/docx"
folder_location = r"C:\Users\WCS\Desktop\file_storage"
newname = "document.docx"
def download_n_rename_file(link):
driver.get(link)
driver.find_element_by_css_selector("a[href$='example_multipage.docx']").click()
#how to rename the downloaded file to "document.docx"
#os.rename()
if __name__ == '__main__':
chromeOptions = webdriver.ChromeOptions()
prefs = {'download.default_directory': folder_location}
chromeOptions.add_experimental_option('prefs', prefs)
driver = webdriver.Chrome(chrome_options=chromeOptions)
download_n_rename_file(url)
I am assuming the file you've downloaded is named under example_multipage.docx:
import os
from selenium import webdriver
url = "https://www.online-convert.com/file-format/docx"
folder_location = r"C:\Users\WCS\Desktop\file_storage"
newname = "document.docx"
def download_n_rename_file(link):
driver.get(link)
driver.find_element_by_css_selector("a[href$='example_multipage.docx']").click()
# To rename the downloaded file to "document.docx"
os.rename('example_multipage.docx',newname)
if __name__ == '__main__':
chromeOptions = webdriver.ChromeOptions()
prefs = {'download.default_directory': folder_location}
chromeOptions.add_experimental_option('prefs', prefs)
driver = webdriver.Chrome(chrome_options=chromeOptions)
download_n_rename_file(url)
EDIT:
OP: but the problem is there is no such existing name in advance.
This makes me think, what if we could find when a file has downloaded successfully and then grab its name? But, wait. that is not possible!
Or could there be a way to detect the name of a downloaded file? But, wait. You don't have control over the download file naming through selenium.

Selenium Python Download popup pdf with specific filename

I need to download a set of individual pdf files from a webpage. It is publicly available by government (ministry of education in Turkey) so totally legal.
However my selenium browser only displays the pdf file, how can I download it and name as I wish.
(This code is also from web)
# Import your newly installed selenium package
from selenium import webdriver
from bs4 import BeautifulSoup
# Now create an 'instance' of your driver
# This path should be to wherever you downloaded the driver
driver = webdriver.Chrome(executable_path="/Users/ugur/Downloads/chromedriver")
# A new Chrome (or other browser) window should open up
download_dir = "/Users/ugur/Downloads/" # for linux/*nix, download_dir="/usr/Public"
options = webdriver.ChromeOptions()
profile = {"plugins.plugins_list": [{"enabled": False, "name": "Chrome PDF Viewer"}], # Disable Chrome's PDF Viewer
"download.default_directory": download_dir , "download.extensions_to_open": "applications/pdf"}
options.add_experimental_option("prefs", profile)
# Now just tell it wherever you want it to go
driver.get("https://odsgm.meb.gov.tr/kurslar/KazanimTestleri.aspx?sinifid=5&ders=29")
driver.find_element_by_id("ContentPlaceHolder1_dtYillikPlanlar_lnkIndir_2").click()
driver.get("https://odsgm.meb.gov.tr/kurslar/PDFFile.aspx?name=kazanimtestleri.pdf")
Thanks in advance
Extra information:
I had a python 2 code doing this perfectly. But somehow it creates empty files and I couldn't convert it to python 3. Maybe this helps (no offense but I never liked selenium)
import urllib
import urllib2
from bs4 import BeautifulSoup
import os
sinifId=5
maxOrd = 1
fileNames=[]
directory = '/Users/ugur/Downloads/Hasan'
print 'List of current files in directory '+ directory+'\n---------------------------------\n\n'
for current_file in os.listdir(directory):
if (current_file.find('pdf')>-1 and current_file.find(' ')>-1):
print current_file
order = int(current_file.split(' ',1)[0])
if order>maxOrd: maxOrd=order
fileNames.append(current_file.split(' ',2)[1])
print '\n\nStarting download \n---------------------------------\n'
ctA=int(maxOrd+1)
for ders in [29]:
urlSinif='http://odsgm.meb.gov.tr/kurslar/KazanimTestleri.aspx?sinifid='+str(sinifId)+'&ders='+str(ders)
page = urllib2.urlopen(urlSinif)
soup = BeautifulSoup(page,"lxml")
st = soup.prettify()
count=st.count('ctl00')-1
dersAdi = soup.find('a', href='/kurslar/CevapAnahtarlari.aspx?sinifid='+str(sinifId)+'&ders='+str(ders)).getText().strip()
for testNo in range(count):
if(str(sinifId)+str(ders)+str(testNo+1) in fileNames):
print str(ctA)+' '+str(sinifId)+str(ders)+str(testNo+1)+' '+dersAdi+str(testNo+1)+'.pdf'+' skipped'
else:
annex=""
if(testNo%2==1): annex="2"
eiha_url = u'http://odsgm.meb.gov.tr/kurslar/KazanimTestleri.aspx?sinifid='+str(sinifId)+'&ders='+str(ders)
data = ('__EVENTTARGET','ctl00$ContentPlaceHolder1$dtYillikPlanlar$ctl'+format(testNo, '02')+'$lnkIndir'+annex), ('__EVENTARGUMENT', '39')
print 'ctl00$ContentPlaceHolder1$dtYillikPlanlar$ctl'+format(testNo, '02')+'$lnkIndir'+annex
new_data = urllib.urlencode(data)
response = urllib2.urlopen(eiha_url, new_data)
urllib.urlretrieve (str(response.url), directory+'/{0:0>3}'.format(ctA)+' '+str(sinifId)+str(ders)+str(testNo+1)+' '+dersAdi+str(testNo+1)+'.pdf')
print str(ctA)+' '+str(sinifId)+str(ders)+str(testNo+1)+' '+dersAdi+str(testNo+1)+'.pdf'+' downloaded'
ctA=ctA+1
Add your options before launching Chrome and then specify the chrome_options parameter.
download_dir = "/Users/ugur/Downloads/"
options = webdriver.ChromeOptions()
profile = {"plugins.plugins_list": [{"enabled": False, "name": "Chrome PDF Viewer"}],
"download.default_directory": download_dir,
"download.extensions_to_open": "applications/pdf"}
options.add_experimental_option("prefs", profile)
driver = webdriver.Chrome(
executable_path="/Users/ugur/Downloads/chromedriver",
chrome_options=options
)
To answer your second question:
May I ask how to specify the filename as well?
I found this: Selenium give file name when downloading
What I do is:
file_name = ''
while file_name.lower().endswith('.pdf') is False:
time.sleep(.25)
try:
file_name = max([download_dir + '/' + f for f in os.listdir(download_dir)], key=os.path.getctime)
except ValueError:
pass
Here is the code sample I used to download pdf with a specific file name. First you need to configure chrome webdriver with required options. Then after clicking the button (to open pdf popup window), call a function to wait for download to finish and rename the downloaded file.
import os
import time
import shutil
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
# function to wait for download to finish and then rename the latest downloaded file
def wait_for_download_and_rename(newFilename):
# function to wait for all chrome downloads to finish
def chrome_downloads(drv):
if not "chrome://downloads" in drv.current_url: # if 'chrome downloads' is not current tab
drv.execute_script("window.open('');") # open a new tab
drv.switch_to.window(driver.window_handles[1]) # switch to the new tab
drv.get("chrome://downloads/") # navigate to chrome downloads
return drv.execute_script("""
return document.querySelector('downloads-manager')
.shadowRoot.querySelector('#downloadsList')
.items.filter(e => e.state === 'COMPLETE')
.map(e => e.filePath || e.file_path || e.fileUrl || e.file_url);
""")
# wait for all the downloads to be completed
dld_file_paths = WebDriverWait(driver, 120, 1).until(chrome_downloads) # returns list of downloaded file paths
# Close the current tab (chrome downloads)
if "chrome://downloads" in driver.current_url:
driver.close()
# Switch back to original tab
driver.switch_to.window(driver.window_handles[0])
# get latest downloaded file name and path
dlFilename = dld_file_paths[0] # latest downloaded file from the list
# wait till downloaded file appears in download directory
time_to_wait = 20 # adjust timeout as per your needs
time_counter = 0
while not os.path.isfile(dlFilename):
time.sleep(1)
time_counter += 1
if time_counter > time_to_wait:
break
# rename the downloaded file
shutil.move(dlFilename, os.path.join(download_dir,newFilename))
return
# specify custom download directory
download_dir = r'c:\Downloads\pdf_reports'
# for configuring chrome pdf viewer for downloading pdf popup reports
chrome_options = webdriver.ChromeOptions()
chrome_options.add_experimental_option('prefs', {
"download.default_directory": download_dir, # Set own Download path
"download.prompt_for_download": False, # Do not ask for download at runtime
"download.directory_upgrade": True, # Also needed to suppress download prompt
"plugins.plugins_disabled": ["Chrome PDF Viewer"], # Disable this plugin
"plugins.always_open_pdf_externally": True, # Enable this plugin
})
# get webdriver with options for configuring chrome pdf viewer
driver = webdriver.Chrome(options = chrome_options)
# open desired webpage
driver.get('https://mywebsite.com/mywebpage')
# click the button to open pdf popup
driver.find_element_by_id('someid').click()
# call the function to wait for download to finish and rename the downloaded file
wait_for_download_and_rename('My file.pdf')
# close the browser windows
driver.quit()
Set timeout (120) to the wait time as per your needs.
Non-selenium solution, You can do something like:
import requests
pdf_resp = requests.get("https://odsgm.meb.gov.tr/kurslar/PDFFile.aspx?name=kazanimtestleri.pdf")
with open("save.pdf", "wb") as f:
f.write(pdf_resp.content)
Although you might want to check the content type before to make sure it's a pdf

Saving embedded .pdf not as .pdf file

I'm trying to download a embedded pdf from the chrome browser with below code, however the file is being stored on my C:\ drive as a the following file: C:\TEST_A_15.pdf.crdownload.
def download_pdf(lnk):
from selenium import webdriver
from time import sleep
options = webdriver.ChromeOptions()
download_folder = "C:\\"
profile = {"plugins.plugins_list": [{"enabled": False,
"name": "Chrome PDF Viewer"}],
"download.default_directory": download_folder,
"download.extensions_to_open": ""}
options.add_experimental_option("prefs", profile)
print("Downloading file from link: {}".format(lnk))
driver = webdriver.Chrome(chrome_options = options)
driver.get(lnk)
filename = lnk.split("=")[3]
print("File: {}".format(filename))
print("Status: Download Complete.")
print("Folder: {}".format(download_folder))
driver.close()
If I adjust the line for filename to what's below, then I get the C:\TEST_A_15.pdf file desired on my harddrive without the .crdownload at the end. But then I get a IndexError: list index out of range which is logical because the "=" is not be found in position 4.
filename = lnk.split("=")[4]
The URL used (I changed the hostname and name of pdf file so URL don't work):
https://testing.nl/getpdf.asp?id=ORsP5UqX6IikuikcGiLD&unique=adda3b24-f9ca-4007-898a-caed5309c140&filename=TEST_A_15.pdf
Even more strange when I use a network drive together with the filename = lnk.split("=")[3] then the file will be stored as a .tmp file i.e.: 2498d715-84aa-4e81-8037-264bb0211b4b.tmp and when I use the incorrect code (filename = lnk.split("=")[4]) it gives the IndexError but saves the file correctly as .pdf file on the network drive.
I've solved it, the problem was that the webdriver closed before the entire pdf was downloaded resulting in .tmp or .crdownload files. So I built in a sleep before closing the driver.

Categories

Resources