I want to save the files I get with a scraper into custom folders. I looked around and none of the solutions I found worked for me. Here is my configuration:
options = webdriver.ChromeOptions()
prefs = {
'profile.default_content_settings.popups': 0,
'download.default_directory': my_data_folder,
"download.directory_upgrade": True,
"download.prompt_for_download": False,
"safebrowsing.enabled":False,
}
options.add_argument('--remote-debugging-port=9222')
options.add_experimental_option("useAutomationExtension", False)
desired_caps = {
'prefs': {
'savefile': {
'default_directory': my_data_folder,
"directory_upgrade": True,
"extensions_to_open": ""
}
}
}
options.add_experimental_option('prefs', prefs)
driver = webdriver.Chrome(chrome_options=options, desired_capabilities=desired_caps)
But when I try downloading, it goes to ~/Downloads/ instead of my_data_folder.
I have tried prefs and desired_caps independently to no avail.
I am using Chromium 108.0.5359.22 snap
Help is appreciated !
I have tried:
How to download to a specific folder with Chromedriver?
Define download directory for chromedriver selenium with python
and many other posts and blogs. All these solutions are summarised in the script above.
Thanks !
UPDATE
It works if I add
options.add_argument("--headless")
The folder option works, but this is not desirable for other reasons. Is there a better way to fix this problem?
Have you thought about using shutil to move the file after the download ?
Here's how I had that implemented in another project I was working on
filename = max([
download_folder + "\\" + f for f in os.listdir(download_folder)],
key=os.path.getctime)
shutil.move(
filename,
os.path.join(download_folder,f"filename.format")
)
Related
I aim to download web files while in headless mode. My program downloads perfectly when NOT in headless mode, but once I add the constraint not to show MS Edge opening, the downloading is disregarded.
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
driver = webdriver.Edge()
driver.get("URL")
id_box = driver.find_element(By.ID,"...")
pw_box = driver.find_element(By.ID,"...")
id_box.send_keys("...")
pw_box.send_keys("...")
log_in = driver.find_element(By.ID,"...")
log_in.click()
time.sleep(0.1) # If not included, get error: "Unable to locate element"
drop_period = Select(driver.find_element(By.ID,"..."))
drop_period.select_by_index(1)
drop_consul = Select(driver.find_element(By.ID,"..."))
drop_consul.select_by_visible_text("...")
drop_client = Select(driver.find_element(By.ID,"..."))
drop_client.select_by_index(1)
# Following files do not download with headless inculded:
driver.find_element(By.XPATH, "...").click()
driver.find_element(By.XPATH, "...").click()
In that case, you might try downloading the file using the direct link (to the file) and python requests.
You'll need to get the url, by parsing the elemt its href:
Downloading and saving a file from url should work as following then:
import requests as req
remote_url = 'http://www.example.com/file.txt'
local_file_name = 'my_file.txt'
data = req.get(remote_url)
# Save file data to local copy
with open(local_file_name, 'wb')as file:
file.write(data.content)
resource
There are different headless modes for Chrome. If you want to download files, use one of the special ones.
For Chrome 109 and above, use:
options.add_argument("--headless=new")
For Chrome 108 and below, use:
options.add_argument("--headless=chrome")
Reference: https://github.com/chromium/chromium/commit/e9c516118e2e1923757ecb13e6d9fff36775d1f4
Downloading files in headless mode works for me on MicrosoftEdge version 110.0.1587.41 using following options:
MicrosoftEdge: [{
"browserName": "MicrosoftEdge",
"ms:edgeOptions": {
args: ['--headless=new'],
prefs: {
"download.prompt_for_download": false,
"plugins.always_open_pdf_externally": true,
'download.default_directory': "dlFolder"
}
},
}]
Nothing worked until I added the option '--headless=new'
N.B: Tested on a Mac environment using webdriverIO
I'm trying to save a web page as a PDF but all I get is a file name selection window. How to automatically enter a file name and save it?
settings = {
"appState": {
"recentDestinations": [{
"id": "Save as PDF",
"origin": "local",
"account": "",
"margin": 0,
'size': 'auto'
}],
"selectedDestinationId": "Save as PDF",
"version": 2,
"margin": 0,
'size': 'auto'
}
}
#There is probably a lot of excess here, I tried to use everything that can help
prefs = {'printing.print_preview_sticky_settings': json.dumps(settings),
'profile.default_content_settings.popups': 0,
'download.name': 'test.pdf', #It doesn't work(
'download.default_directory': download_path,
'savefile.default_directory': download_path,
'download.prompt_for_download': False,
"download.directory_upgrade": True,
"safebrowsing_for_trusted_sources_enabled": False,
"safebrowsing.enabled": True,
"download.extensions_to_open": "",
"plugins.always_open_pdf_externally": True,
}
options.add_experimental_option('prefs', prefs)
options.add_argument('--kiosk-printing')
driver = webdriver.Chrome(service=ser, options=options)
driver.maximize_window()
driver.get('url')
driver.execute_script('window.print();')
time.sleep(20)
I couldn't find a solution on the internet, I tried every possible option but it doesn't work for me.
There is no built-in function in Selenium that allows you to save a web page as a PDF. However, you can use a third-party tool, such as wkhtmltopdf, to accomplish this.
Install wkhtmltopdf
Download the wkhtmltopdf binaries from the official website and install them on your system.
Add wkhtmltopdf to your PATH
Add the wkhtmltopdf binary to your system PATH so that Selenium can find it.
Use the save_as_pdf function
The save_as_pdf function takes a Selenium webdriver instance and a filename as arguments and saves the current page as a PDF.
def save_as_pdf(driver, filename): driver.execute_script('window.print();') sleep(5) with open(filename, 'wb') as file: file.write(driver.page_source.encode('utf-8'))
I was able to solve this problem using the pyautogui library. Although I think that this is not the best solution
import pyautogui as pag
driver.execute_script('window.print();')
time.sleep(20)
pag.typewrite('test.pdf')
time.sleep(1)
pag.press("enter")
time.sleep(20)
After an update in selenium and visual studio I have the following problem. I try to get a url for example
thestore = "http://shop.oki.gr/shop/store/diathesimotita_new.asp" and instead I have a window opened with
http://www.puttop.top/object.php?u=http://shop.oki.gr/shop/store/customerauthenticateform.asp?redirectUrl=http://shop.oki.gr/shop/store/diathesimotita_new.asp&title=Login%20Page
which of course is not working.
options = webdriver.ChromeOptions()
options.add_experimental_option("prefs", {"download.default_directory": downloads_path,
"profile.default_content_settings.popups": 0,
"download.prompt_for_download": False,
"directory_upgrade": True,
"safebrowsing.enabled": True})
browser = webdriver.Chrome(service=Service(ChromeDriverManager().install()),options=options)
thestore = "http://shop.oki.gr/shop/store/diathesimotita_new.asp"
browser.get(thestore)
The initial url is opening from other pc normally..
What is happening?
I cant open either of these but it seems like its redirecting you to a login page first
I have a couple programs that do something similar and I keep a function that login for me on hand, then I get the original url again
I have already been to this link with same question, but I cannot find an answer to it:
Although my question is the same as the other question, I posted a new one with my code as well.
url='https://example.com/'
download_url="https:/example.com/Download"
chromedriver = 'path\\to\chromedriver.exe'
options = Options()
ua = UserAgent()
userAgent = ua.random
print(userAgent)
options.add_argument(f'user-agent={userAgent}')
options.add_experimental_option("prefs", {
"download.default_directory": r"C:\Users\helia\Desktop\Test",
"download.prompt_for_download": False,
"download.directory_upgrade": True,
"safebrowsing.enabled": True
})
options.add_argument("--headless")
options.add_argument("--window-size=%s" % WINDOW_SIZE)
driver = webdriver.Chrome(chrome_options=options, executable_path=chromedriver)
driver.get(url)
user_name = driver.find_element_by_name('User')
pass_word = driver.find_element_by_name("Pass")
user_name.send_keys("my_username")
pass_word.send_keys("my_password")
driver.find_element_by_class_name("btnn.btnn-default.b").click()
driver.get(download_url)
driver.find_element_by_class_name("btn.btn-app").click()
driver.switch_to.alert.accept()
The code successfully downloads the file but the file is 0 KB. both on the website and on my local; however, the file on the site has never been 0 before.
(the program finishes while the file is being downloaded, could it be the cause? do I need to add some waits?)
Your question is not clear enough, however I guess your problem is:
After clicking the download button and accepting the alert your code finishes immediately so downloaded file have had no enough time to be actually downloaded.
In order to get the file completely downloaded you should prevent browser to be closed until the downloading not complete.
So the issue of downloading files via headless chrome with selenium still seems to be a problem as it was asked here with no answer over a month ago. but I don't understand how they are implementing the js which is in the bug thread. Is there an option I can add or a current fix for this? The original bug page located here
All of my stuff is up to date as of today 10/22/17
In python:
from selenium import webdriver
options = webdriver.ChromeOptions()
prefs = {"download.default_directory": "C:/Stuff",
"download.prompt_for_download": False,
"download.directory_upgrade": True,
"plugins.always_open_pdf_externally": True
}
options.add_experimental_option("prefs", prefs)
options.add_argument('headless')
driver = webdriver.Chrome(r'C:/Users/aaron/chromedriver.exe', chrome_options = options)
# test file to download which doesn't work
driver.get('http://ipv4.download.thinkbroadband.com/5MB.zip')
If the headless option is removed this works no problem.
The actual files I'm attempting to download are PDFs located at .aspx URLs. I'm downloading them by doing a .click() and it works great except not with the headless version. The hrefs are javascript do_postback scripts.
Why don't you locate the anchor href and then use get request to download the file. This way it will work in headless mode and will be much faster. I have done that in C#.
def download_file(url):
local_filename = url.split('/')[-1]
# NOTE the stream=True parameter
r = requests.get(url, stream=True)
with open(local_filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
#f.flush() commented by recommendation from J.F.Sebastian
return local_filename
I believe now that Chromium supports this feature (as you linked to the bug ticket), it falls to the chromedriver team to add support for the feature. There is an open ticket here, but it does not appear to have a high priority at the moment. Please, everyone who needs this feature, go give it a +1!
For those of you not on the chromium ticket linked above or haven't found a solution. This is working for me. Chrome is updated to v65 and chromedriver/selenium are both up to date as of 4/16/18.
prefs = {'download.prompt_for_download': False,
'download.directory_upgrade': True,
'safebrowsing.enabled': False,
'safebrowsing.disable_download_protection': True}
options.add_argument('--headless')
options.add_experimental_option('prefs', prefs)
driver = webdriver.Chrome('chromedriver.exe', chrome_options=options)
driver.command_executor._commands["send_command"] = ("POST", '/session/$sessionId/chromium/send_command')
driver.desired_capabilities['browserName'] = 'ur mum'
params = {'cmd': 'Page.setDownloadBehavior', 'params': {'behavior': 'allow', 'downloadPath': r'C:\chickenbutt'}}
driver.execute("send_command", params)
If you're getting a Failed-file path too long error when downloading make sure that the downloadpath does't have a trailing space or slash\or backslash. The path must also use backslashes only. I have no idea why.