Example: https://apps1.lavote.net/camp/comm.cfm?&cid=82
With Selenium, I am clicking the first Form 497. In my browser, a new tab of the pdf opens. In selenium, nothing seems to happen.
Here is my code, with some parts redacted.
def scrape(session_key=None):
options = Options()
options.headless = True
profile = webdriver.FirefoxProfile()
profile.set_preference("browser.download.dir", os.path.join(base_dir, 'reports'))
profile.set_preference("browser.download.folderList", 2)
profile.set_preference("browser.helperApps.alwaysAsk.force", False);
profile.set_preference("browser.download.manager.showAlertOnComplete", False)
profile.set_preference("browser.download.manager.showWhenStarting", False);
profile.set_preference('browser.helperApps.neverAsk.saveToDisk','application/zip,application/octet-stream,application/x-zip-compressed,multipart/x-zip,application/x-rar-compressed, application/octet-stream,application/msword,application/vnd.ms-word.document.macroEnabled.12,application/vnd.openxmlformats-officedocument.wordprocessingml.document,application/vnd.ms-excel,application/vnd.openxmlformats-officedocument.spreadsheetml.sheet,application/vnd.openxmlformats-officedocument.spreadsheetml.sheet,application/vnd.openxmlformats-officedocument.wordprocessingml.document,application/vnd.openxmlformats-officedocument.spreadsheetml.sheet,application/rtf,application/vnd.openxmlformats-officedocument.spreadsheetml.sheet,application/vnd.ms-excel,application/vnd.ms-word.document.macroEnabled.12,application/vnd.openxmlformats-officedocument.wordprocessingml.document,application/xls,application/msword,text/csv,application/vnd.ms-excel.sheet.binary.macroEnabled.12,text/plain,text/csv/xls/xlsb,application/csv,application/download,application/vnd.openxmlformats-officedocument.presentationml.presentation,application/octet-stream')
profile.set_preference("pdfjs.disabled", True)
profile.set_preference("plugin.disable_full_page_plugin_for_types", "application/pdf")
driver = webdriver.Firefox(firefox_profile=profile, options=options)
driver.get(magic_url)
committee_table = driver.find_elements_by_css_selector('table')[2]
links = [link.get_attribute('href') for link in committee_table.find_elements_by_tag_name('a')]
driver.get('https://apps1.lavote.net/camp/comm.cfm?&cid=82')
forms_table = driver.find_elements_by_css_selector('table')[1]
forms_table_trs = forms_table.find_elements_by_css_selector('tr')
for i, row in enumerate(forms_table_trs):
if i > 0:
cells = row.find_elements_by_css_selector('td')
print(1)
try:
link = cells[2].find_elements_by_tag_name('a')[0]
link.click()
pdfs = glob.glob(os.path.join(base_dir, 'scraper/*.pdf'))
latest_pdf_file = max(pdfs, key=os.path.getctime)
parse_funcs[form_type](latest_pdf_file)
except Exception as e:
print(e)
As you may have guessed, there are no pdfs. They are not downloaded. That's why I'm here. How can I do this?
If you only need the files and not to test the actual browser dialogue routine, grab the files using Python instead of asking Selenium to do that.
Grab the PDF URLs from the page, then use request to download the file to your memory and then open().write() to save it to the file system.
req = requests.get(url, allow_redirects=True)
open(filename, 'wb').write(r.content)
You can also get the filename from r, but it's a bit bothersome. Check it here: https://www.codementor.io/#aviaryan/downloading-files-from-urls-in-python-77q3bs0un
Related
I would like to automate the following situation:
determine all links on a website
put them into one file
check if there are new links on the website (compare with the previous file)
if there are new links on the website, then put them in the file
Any ideas on how I could implement this? How should i save the links? (as json? simple?)
What I have so far:
links = driver.find_elements_by_css_selector("[href*='search']")
links2 = [elem.get_attribute('href') for elem in links]
print(links2)
Output:
['https://www.xyz/testing', 'https://www.xyz/testing2', 'hhttps://www.xyz/testing3']
Here's a Python solution that adds new links to a file called links.txt every time it is run. This assumes that you have selenium installed, and that you have chromedriver on your System PATH.
import codecs
import json
import os
from selenium import webdriver
from selenium.webdriver.common.by import By
# Spin up a new Chrome browser
options = webdriver.ChromeOptions()
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option("useAutomationExtension", False)
driver = webdriver.Chrome(options=options)
# Open the URL and then get all links from the page
driver.get("https://seleniumbase.io/demo_page")
links = driver.find_elements(By.CSS_SELECTOR, "a[href]")
links = [elem.get_attribute('href') for elem in links]
links = list(set(links)) # Remove duplicates
# Get existing links from "links.txt", then combine all links
abs_path = os.path.abspath(".")
file_name = "links.txt"
file_path = os.path.join(abs_path, file_name)
file_links = []
if os.path.exists(file_path):
with open(file_path, "r") as f:
json_file_links = f.read().strip()
file_links = json.loads(json_file_links)
all_links = list(set(links + file_links))
# Print each link on a new line
for link in all_links:
print(link)
# Save all links in json format into "links.txt"
json_all_links = json.dumps(all_links)
links_file = codecs.open(file_path, "w+", encoding="utf-8")
links_file.writelines(json_all_links)
links_file.close()
driver.quit() # Close the browser
Here's the printed output of that for the URL provided. (The links will also be in a file called links.txt.)
https://github.com/seleniumbase/SeleniumBase
https://seleniumbase.com/
https://seleniumbase.io/
https://seleniumbase.io/demo_page/
I am currently trying to download a few pdf files from http://annualreports.com/Company/abercrombie-fitch and I am having a problem downloading the 2019 Annual Report. I am currently using
response = urllib2.urlopen("http://annualreports.com" + link)
file = open(name, 'wb')
file.write(response.read())
where link is '/Click/20415' but this is returning a text file rather than a pdf. Is there a specific way to fix this?
Another solution, using requests module.
import requests
url = 'http://annualreports.com/Click/20415'
with requests.get(url, stream=True) as r:
filename = r.url.split('/')[-1]
with open(filename, 'wb') as f_out:
for chunk in r.iter_content(chunk_size=8192):
if chunk:
print('.', end='')
f_out.write(chunk)
This saves NYSE_ANF_2019.pdf file to your disk.
EDIT: Screenshot from PDF in Firefox:
If you use Selenium you could try this:
from selenium import webdriver
download_dir = "C:\\Temp\\Dowmload" # for linux/*nix, download_dir="/usr/Public"
options = webdriver.ChromeOptions()
profile = {"plugins.plugins_list": [{"enabled": False, "name": "Chrome PDF Viewer"}], # Disable Chrome's PDF Viewer
"download.default_directory": download_dir , "download.extensions_to_open": "applications/pdf"}
options.add_experimental_option("prefs", profile)
driver = webdriver.Chrome('//Server/Apps/chrome_driver/chromedriver.exe', chrome_options=options) # Optional argument, if not specified will search path.
driver.get('http://annualreports.com' + link)
If you only want to download the PDF and do not want to do anything on the site, I think it is better to use the method that #superstew says. See:
https://stackabuse.com/download-files-with-python/
I need to download a set of individual pdf files from a webpage. It is publicly available by government (ministry of education in Turkey) so totally legal.
However my selenium browser only displays the pdf file, how can I download it and name as I wish.
(This code is also from web)
# Import your newly installed selenium package
from selenium import webdriver
from bs4 import BeautifulSoup
# Now create an 'instance' of your driver
# This path should be to wherever you downloaded the driver
driver = webdriver.Chrome(executable_path="/Users/ugur/Downloads/chromedriver")
# A new Chrome (or other browser) window should open up
download_dir = "/Users/ugur/Downloads/" # for linux/*nix, download_dir="/usr/Public"
options = webdriver.ChromeOptions()
profile = {"plugins.plugins_list": [{"enabled": False, "name": "Chrome PDF Viewer"}], # Disable Chrome's PDF Viewer
"download.default_directory": download_dir , "download.extensions_to_open": "applications/pdf"}
options.add_experimental_option("prefs", profile)
# Now just tell it wherever you want it to go
driver.get("https://odsgm.meb.gov.tr/kurslar/KazanimTestleri.aspx?sinifid=5&ders=29")
driver.find_element_by_id("ContentPlaceHolder1_dtYillikPlanlar_lnkIndir_2").click()
driver.get("https://odsgm.meb.gov.tr/kurslar/PDFFile.aspx?name=kazanimtestleri.pdf")
Thanks in advance
Extra information:
I had a python 2 code doing this perfectly. But somehow it creates empty files and I couldn't convert it to python 3. Maybe this helps (no offense but I never liked selenium)
import urllib
import urllib2
from bs4 import BeautifulSoup
import os
sinifId=5
maxOrd = 1
fileNames=[]
directory = '/Users/ugur/Downloads/Hasan'
print 'List of current files in directory '+ directory+'\n---------------------------------\n\n'
for current_file in os.listdir(directory):
if (current_file.find('pdf')>-1 and current_file.find(' ')>-1):
print current_file
order = int(current_file.split(' ',1)[0])
if order>maxOrd: maxOrd=order
fileNames.append(current_file.split(' ',2)[1])
print '\n\nStarting download \n---------------------------------\n'
ctA=int(maxOrd+1)
for ders in [29]:
urlSinif='http://odsgm.meb.gov.tr/kurslar/KazanimTestleri.aspx?sinifid='+str(sinifId)+'&ders='+str(ders)
page = urllib2.urlopen(urlSinif)
soup = BeautifulSoup(page,"lxml")
st = soup.prettify()
count=st.count('ctl00')-1
dersAdi = soup.find('a', href='/kurslar/CevapAnahtarlari.aspx?sinifid='+str(sinifId)+'&ders='+str(ders)).getText().strip()
for testNo in range(count):
if(str(sinifId)+str(ders)+str(testNo+1) in fileNames):
print str(ctA)+' '+str(sinifId)+str(ders)+str(testNo+1)+' '+dersAdi+str(testNo+1)+'.pdf'+' skipped'
else:
annex=""
if(testNo%2==1): annex="2"
eiha_url = u'http://odsgm.meb.gov.tr/kurslar/KazanimTestleri.aspx?sinifid='+str(sinifId)+'&ders='+str(ders)
data = ('__EVENTTARGET','ctl00$ContentPlaceHolder1$dtYillikPlanlar$ctl'+format(testNo, '02')+'$lnkIndir'+annex), ('__EVENTARGUMENT', '39')
print 'ctl00$ContentPlaceHolder1$dtYillikPlanlar$ctl'+format(testNo, '02')+'$lnkIndir'+annex
new_data = urllib.urlencode(data)
response = urllib2.urlopen(eiha_url, new_data)
urllib.urlretrieve (str(response.url), directory+'/{0:0>3}'.format(ctA)+' '+str(sinifId)+str(ders)+str(testNo+1)+' '+dersAdi+str(testNo+1)+'.pdf')
print str(ctA)+' '+str(sinifId)+str(ders)+str(testNo+1)+' '+dersAdi+str(testNo+1)+'.pdf'+' downloaded'
ctA=ctA+1
Add your options before launching Chrome and then specify the chrome_options parameter.
download_dir = "/Users/ugur/Downloads/"
options = webdriver.ChromeOptions()
profile = {"plugins.plugins_list": [{"enabled": False, "name": "Chrome PDF Viewer"}],
"download.default_directory": download_dir,
"download.extensions_to_open": "applications/pdf"}
options.add_experimental_option("prefs", profile)
driver = webdriver.Chrome(
executable_path="/Users/ugur/Downloads/chromedriver",
chrome_options=options
)
To answer your second question:
May I ask how to specify the filename as well?
I found this: Selenium give file name when downloading
What I do is:
file_name = ''
while file_name.lower().endswith('.pdf') is False:
time.sleep(.25)
try:
file_name = max([download_dir + '/' + f for f in os.listdir(download_dir)], key=os.path.getctime)
except ValueError:
pass
Here is the code sample I used to download pdf with a specific file name. First you need to configure chrome webdriver with required options. Then after clicking the button (to open pdf popup window), call a function to wait for download to finish and rename the downloaded file.
import os
import time
import shutil
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
# function to wait for download to finish and then rename the latest downloaded file
def wait_for_download_and_rename(newFilename):
# function to wait for all chrome downloads to finish
def chrome_downloads(drv):
if not "chrome://downloads" in drv.current_url: # if 'chrome downloads' is not current tab
drv.execute_script("window.open('');") # open a new tab
drv.switch_to.window(driver.window_handles[1]) # switch to the new tab
drv.get("chrome://downloads/") # navigate to chrome downloads
return drv.execute_script("""
return document.querySelector('downloads-manager')
.shadowRoot.querySelector('#downloadsList')
.items.filter(e => e.state === 'COMPLETE')
.map(e => e.filePath || e.file_path || e.fileUrl || e.file_url);
""")
# wait for all the downloads to be completed
dld_file_paths = WebDriverWait(driver, 120, 1).until(chrome_downloads) # returns list of downloaded file paths
# Close the current tab (chrome downloads)
if "chrome://downloads" in driver.current_url:
driver.close()
# Switch back to original tab
driver.switch_to.window(driver.window_handles[0])
# get latest downloaded file name and path
dlFilename = dld_file_paths[0] # latest downloaded file from the list
# wait till downloaded file appears in download directory
time_to_wait = 20 # adjust timeout as per your needs
time_counter = 0
while not os.path.isfile(dlFilename):
time.sleep(1)
time_counter += 1
if time_counter > time_to_wait:
break
# rename the downloaded file
shutil.move(dlFilename, os.path.join(download_dir,newFilename))
return
# specify custom download directory
download_dir = r'c:\Downloads\pdf_reports'
# for configuring chrome pdf viewer for downloading pdf popup reports
chrome_options = webdriver.ChromeOptions()
chrome_options.add_experimental_option('prefs', {
"download.default_directory": download_dir, # Set own Download path
"download.prompt_for_download": False, # Do not ask for download at runtime
"download.directory_upgrade": True, # Also needed to suppress download prompt
"plugins.plugins_disabled": ["Chrome PDF Viewer"], # Disable this plugin
"plugins.always_open_pdf_externally": True, # Enable this plugin
})
# get webdriver with options for configuring chrome pdf viewer
driver = webdriver.Chrome(options = chrome_options)
# open desired webpage
driver.get('https://mywebsite.com/mywebpage')
# click the button to open pdf popup
driver.find_element_by_id('someid').click()
# call the function to wait for download to finish and rename the downloaded file
wait_for_download_and_rename('My file.pdf')
# close the browser windows
driver.quit()
Set timeout (120) to the wait time as per your needs.
Non-selenium solution, You can do something like:
import requests
pdf_resp = requests.get("https://odsgm.meb.gov.tr/kurslar/PDFFile.aspx?name=kazanimtestleri.pdf")
with open("save.pdf", "wb") as f:
f.write(pdf_resp.content)
Although you might want to check the content type before to make sure it's a pdf
So the issue of downloading files via headless chrome with selenium still seems to be a problem as it was asked here with no answer over a month ago. but I don't understand how they are implementing the js which is in the bug thread. Is there an option I can add or a current fix for this? The original bug page located here
All of my stuff is up to date as of today 10/22/17
In python:
from selenium import webdriver
options = webdriver.ChromeOptions()
prefs = {"download.default_directory": "C:/Stuff",
"download.prompt_for_download": False,
"download.directory_upgrade": True,
"plugins.always_open_pdf_externally": True
}
options.add_experimental_option("prefs", prefs)
options.add_argument('headless')
driver = webdriver.Chrome(r'C:/Users/aaron/chromedriver.exe', chrome_options = options)
# test file to download which doesn't work
driver.get('http://ipv4.download.thinkbroadband.com/5MB.zip')
If the headless option is removed this works no problem.
The actual files I'm attempting to download are PDFs located at .aspx URLs. I'm downloading them by doing a .click() and it works great except not with the headless version. The hrefs are javascript do_postback scripts.
Why don't you locate the anchor href and then use get request to download the file. This way it will work in headless mode and will be much faster. I have done that in C#.
def download_file(url):
local_filename = url.split('/')[-1]
# NOTE the stream=True parameter
r = requests.get(url, stream=True)
with open(local_filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
#f.flush() commented by recommendation from J.F.Sebastian
return local_filename
I believe now that Chromium supports this feature (as you linked to the bug ticket), it falls to the chromedriver team to add support for the feature. There is an open ticket here, but it does not appear to have a high priority at the moment. Please, everyone who needs this feature, go give it a +1!
For those of you not on the chromium ticket linked above or haven't found a solution. This is working for me. Chrome is updated to v65 and chromedriver/selenium are both up to date as of 4/16/18.
prefs = {'download.prompt_for_download': False,
'download.directory_upgrade': True,
'safebrowsing.enabled': False,
'safebrowsing.disable_download_protection': True}
options.add_argument('--headless')
options.add_experimental_option('prefs', prefs)
driver = webdriver.Chrome('chromedriver.exe', chrome_options=options)
driver.command_executor._commands["send_command"] = ("POST", '/session/$sessionId/chromium/send_command')
driver.desired_capabilities['browserName'] = 'ur mum'
params = {'cmd': 'Page.setDownloadBehavior', 'params': {'behavior': 'allow', 'downloadPath': r'C:\chickenbutt'}}
driver.execute("send_command", params)
If you're getting a Failed-file path too long error when downloading make sure that the downloadpath does't have a trailing space or slash\or backslash. The path must also use backslashes only. I have no idea why.
My following code:
#!/usr/bin/env python
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.by import By
# Define firefox profile
download_dir = "/Users/pdubois/Desktop/TargetMine_Output/"
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.folderList",2)
fp.set_preference("browser.download.manager.showWhenStarting",False)
fp.set_preference("browser.download.dir", download_dir)
#fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "text")
fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "text/plain")
driver = webdriver.Firefox(fp)
driver.implicitly_wait(20)
genes = "Eif2ak2,Pcgf5,Vps25,Prmt6,Tcerg1,Dnttip2,Preb,Polr1b,Gabpb1,Prdm1,Fosl2,Zfp143,Psip1,Kat6a,Tgif1,Txn1,Irf8,Cnot6l,Zfp451,Foxk2,Lpxn,Etv6,Khsrp,Lmo4,Nkrf,Mafk,Mbd1,Cited2,Elp5,Jdp2,Bzw1,Rbm15b,Klf9,Gtf2e2,Dynll1,Klf6,Stat1,Srrt,Gtf2f1,Adnp2,Ikbkg,Mybbp1a,Nup62,Brd2,Chd1,Kctd1,Sap30,Cebpd,Mtf1,Gtf2h2,Fubp1,Tcea1,Irf2bp2,Ezh2,Hnrpdl,Pml,Cebpz,Med7"
targetmine_url = "http://targetmine.nibio.go.jp/targetmine/begin.do"
driver.get(targetmine_url)
# Define type of list to be submitted
gene_select = Select(driver.find_element_by_name("type"))
gene_select.select_by_visible_text(u"Gene")
# Enter list and submit
gene_input = driver.find_element_by_id("listInput")
gene_input.send_keys(genes)
submit = driver.find_element_by_css_selector("input.button.light").click()
# Choose name for list
driver.find_element_by_id("newBagName").clear()
driver.find_element_by_id("newBagName").send_keys("ADX.06.ID.Clust1")
driver.switch_to_frame("__pomme::0")
# Add All
driver.find_element_by_css_selector("span.small.success.add-all.button").click()
# Save all genes
driver.find_element_by_css_selector("a.success.button.save").click()
# Select M. Musculus
driver.find_element_by_xpath("//ul[#id='customConverter']/li[2]/a[1]").click()
# Gene enrchment part
go_xpath = "//div[#id='gene_go_enrichment-widget']/div[#class='inner']/div[1]/div[#class='form']/form[#style='margin:0']/div[2]/select[1]"
#driver.find_element_by_xpath(go_xpath).click()
go_select = Select(driver.find_element_by_xpath(go_xpath))
go_select.select_by_visible_text(u"1.00")
# Download
#driver.find_element_by_css_selector("a.btn.btn-small.export").click()
Works fine. Which end with this instances:
One last thing I want to achieve then is to Save the file automatically.
Despite I've already set the Firefox profile at the top
of the code, it doesn't do as I hoped. What's the right way to do it?
Update:
The solution by alecxe works.
Except I tried this, it doesn't save the file.
go_download_xpath = "//div[#id='gene_go_enrichment-widget']/div[#class='inner']/div[1]/div[2]/a[#class='btn btn-small export']"
driver.find_element_by_xpath(go_download_xpath).click()
# it saved the specific desired file.
# using
#driver.find_element_by_css_selector("a.btn.btn-small.export").click()
#save the wrong file.
This particular dialog cannot be controlled via selenium - this is a browser popup, not a javascript popup (which can be automated with swith_to.alert).
In this case, you need to avoid the popup being shown in the first place and make Firefox download the file automatically by tweaking browser's desired capabilities (aka profile preferences). Firefox can download files automatically depending on the mime-type of the file being downloaded. In your case it is text/plain:
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.folderList",2)
fp.set_preference("browser.download.dir", download_dir)
fp.set_preference("browser.download.manager.showWhenStarting", False)
fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "text/plain")
driver = webdriver.Firefox(firefox_profile=fp)
FYI, I've downloaded the file manually and used magic module to detect the mime-type:
In [1]: import magic
In [2]: mime = magic.Magic(mime=True)
In [3]: mime.from_file("result.tsv")
Out[3]: 'text/plain'
Try:
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.dir", download_dir)
fp.set_preference("browser.preferences.instantApply", True)
fp.set_preference("browser.helperApps.neverAsk.saveToDisk",
"text/plain, application/octet-stream, application/binary, text/csv, application/csv, application/excel, text/comma-separated-values, text/xml, application/xml")
fp.set_preference("browser.helperApps.alwaysAsk.force", False)
fp.set_preference("browser.download.manager.showWhenStarting", False)
fp.set_preference("browser.download.folderList", 2)
driver = webdriver.Firefox(firefox_profile=fp)