I'm trying to download and rename files(around 60 per page) using selenium and hit a hard bump.
Here is what I have tried:
1.try to use the solution offered by supputuri, go through the chrome://downloads download manager, I used the code provided but encountered 2 issues: the opened tab does not close properly(which I can fix), most importantly, the helper function provided keeps returning 'None' as file name despite the fact that I can find the files downloaded in my download directory. This approach can work but prolly need some modification in the chrome console command part which I have no knowledge with.
# method to get the downloaded file name
def getDownLoadedFileName(waitTime):
driver.execute_script("window.open()")
# switch to new tab
driver.switch_to.window(driver.window_handles[-1])
# navigate to chrome downloads
driver.get('chrome://downloads')
# define the endTime
endTime = time.time()+waitTime
while True:
try:
# get downloaded percentage
downloadPercentage = driver.execute_script(
"return document.querySelector('downloads-manager').shadowRoot.querySelector('#downloadsList downloads-item').shadowRoot.querySelector('#progress').value")
# check if downloadPercentage is 100 (otherwise the script will keep waiting)
if downloadPercentage == 100:
# return the file name once the download is completed
return driver.execute_script("return document.querySelector('downloads-manager').shadowRoot.querySelector('#downloadsList downloads-item').shadowRoot.querySelector('div#content #file-link').text")
except:
pass
time.sleep(1)
if time.time() > endTime:
break
Selenium give file name when downloading
the second approach I looked at is offered by Red from the post below. I figured since I'm downloading 1 file at a time, I could always find the most recent file and then change the file name after download completes and repeat this process. For this approach I have the following issue: once I grabbed the file object, I cannot seem to find a way to get the file name, I checked the python methods for file object and it doesn't have one that returns the name of the file.
import os
import time
def latest_download_file(num_file,path):
os.chdir(path)
while True:
files = sorted(os.listdir(os.getcwd()), key=os.path.getmtime)
#wait for file to be finish download
if len(files) < num_file:
time.sleep(1)
print('waiting for download to be initiated')
else:
newest = files[-1]
if ".crdownload" in newest:
time.sleep(1)
print('waiting for download to complete')
else:
return newest
python selenium, find out when a download has completed?
Let me know if you have any suggestions. Thanks.
The second approach worked, just download, monitor the directory and use os.rename after download finishes.
Related
This is the code immediately before I get stuck:
#Start a new post
driver.find_element_by_xpath("//span[normalize-space()='Start a post']").click()
time.sleep(2)
driver.find_element_by_xpath("//li-icon[#type='image-icon']//*[name()='svg']").click()
time.sleep(2)
The above code works well. The next step though has me puzzled.
I would like to upload the latest image file from my downloads folder. When I click on the link above in LinkedIn it navigates to my user folder (Melissa). (see image)
So... I'm looking for the next lines of code to navigate from the default folder to the Downloads folder (C:\Users\Melissa\Downloads), then, select the newest file, then click 'open' so it attaches to the LinkedIn post.
You can upload the image using this method:
import getpass, glob, os
# Replace this with the code to get the upload input tag
upload_button = driver.find_element_by_xpath("//input[#type='file']")
# Get downloads
downloads = glob.glob("C:\\Users\\{}\\Downloads\\*".format(getpass.getuser()))
# Find the latest downloaded file
latest_download = max(downloads, key=os.path.getctime)
# Enter it into the upload input tag
upload_button.send_keys(latest_download)
I am using selenium to login a page, and download some tiff files,
now i have a variable downloadurl, it contains an array of url links which i scraped from the website. now i am using the below code to download files:
driver = webdriver.Chrome();
driver.get(downloadurl)
I do get all files downloaded but with no names, eg. img(1), img(2) ...
Now my problem is: I want driver.get(downloadurl) download files one by one according to downloadurl array sequence, and rename the file right after it is downloaded according to title variable which is an array, then download the next file, and rename...
P.S. I avoid to use requests because the login procedure is very complicated and requires authorization cookies.
Many thanks for the help!
To elaborate on my comment:
import os
import time
for downloadlink, uniqueName in my_list_of_links_and_names:
driver = webdriver.Chrome();
driver.get(downloadurl)
time.sleep(5) # give it time to download (not sure if this is necessary)
# the file is now downloaded
os.rename("img(1).png", uniqueName) # the name is now changed
This will work assuming that "img(1).png" will be renamed and then the next download will come in as "img(1).png" yet again.
The hardest part would be making my_list_of_links_and_names but if you have the data in separate lists, just zip() them together. You can also generate your own title every loop based on some criteria...
First we will create a function (Rename_file) that renames the downloaded image from its folder.
def Rename_file(new_name, Dl_path): #Renames Downloaded Files in the path
filename = max([f for f in os.listdir(Dl_path)])
if 'image.png' in filename: #Finds 'image.png' name in said path
time.sleep(2) #you can change the value in here depending on your requirements
os.rename(os.path.join(Dl_path, filename), os.path.join(Dl_path, new_name+'.png')) #can be changed to .jpg etc
Then we Apply this function in array of url links:
for link in downloadurl: #Will get each link in download url array
for new_name in title:
driver.get(link) #download the said image in link
Rename_file(new_name,Dl_path)
Sample code:
downloadurl = ['www.sample2.com','www.sample2.com']
Dl_path = "//location//of//image_downloaded"
title = ['Title 1', 'Title 2']
def Rename_file(new_name, Dl_path):
filename = max([f for f in os.listdir(Dl_path)])
if 'image.png' in filename:
time.sleep(2)
os.rename(os.path.join(Dl_path, filename), os.path.join(Dl_path, new_name+'.png'))
for new_name in title:
for link in downloadurl:
driver.get(link)
time.sleep(2)
Rename_file(new_name,Dl_path)
I'm quite sure on the Rename function I created but I haven't really tested this with an array of url links since I really can't think of where could I test it. Hopefully this works on you. Please let me know :-)
Backstory is im trying to pull some data from an ftp login I was given. This data constantly gets updated, about daily, and I believe they wipe the ftp at the end of each week or month. I was thinking about inputting a date and having the script run daily to see if there are any files that match the date, but if the servers time isn't accurate it could cause data loss. For now I just want to download ALL the files, and then ill work on fine-tuning it.
I haven't worked much with coding ftp before, but seems simple enough. However, the problem I'm having is small files get downloaded without a problem and their file sizes check out and match. When it tries to download a big file that would normally take a few minutes, it gets to a certain point (almost completing the file) and then it just stops and the script hangs.
For Example:
It tries to download a file that is 373485927 bytes in size. The script runs and downloads that file up until 373485568 bytes. It ALWAYS stops at this amount after trying different methods and changing some code.
Don't understand why it always stops at this byte and why it would work fine with smaller files (1000 bytes and under).
import os
import sys
import base64
import ftplib
def get_files(ftp, filelist):
for f in filelist:
try:
print "Downloading file " + f + "\n"
local_file = os.path.join('.', f)
file = open(local_file, "wb")
ftp.retrbinary('RETR ' + f, file.write)
except ftplib.all_errors, e:
print str(e)
file.close()
ftp.quit()
def list_files(ftp):
print "Getting directory listing...\n"
ftp.dir()
filelist = ftp.nlst()
#determine new files to DL, pass to get_files()
#for now we will download all each execute
get_files(ftp, filelist)
def get_conn(host,user,passwd):
ftp = ftplib.FTP()
try:
print "\nConnecting to " + host + "...\n"
ftp.connect(host, 21)
except ftplib.all_errors, e:
print str(e)
try:
print "Logging in...\n"
ftp.login(user, base64.b64decode(passwd))
except ftplib.all_errors, e:
print str(e)
ftp.set_pasv(True)
list_files(ftp)
def main():
host = "host.domain.com"
user = "admin"
passwd = "base64passwd"
get_conn(host,user,passwd)
if __name__ == '__main__':
main()
Output looks like this with file dddd.tar.gz being the big one and never finishes it.
Downloading file aaaa.del.gz
Downloading file bbbb.del.gz
Downloading file cccc.del.gz
Downloading file dddd.tar.gz
This could be caused by a timeout issue, perhaps try in:
def get_conn(host,user,passwd):
ftp = ftplib.FTP()
add in larger timeouts until you have more of an idea whats going on, like:
def get_conn(host,user,passwd):
ftp = ftplib.FTP(timeout=100)
I'm not sure if ftplib defaults to a timeout or not, it would be worth checking and worth checking if you are being timed-out from the server. Hope this helps.
If you are running your scrpit in windows cmd console, try to disable the "QuickEdit Mode" option of cmd.
I had encontered a problem that my ftp script hangs running in windows, but works normally in linux. At last i found that solution is working for me.
Ref:enter link description here
I'm using a script to download images from a website and up until recently it has been working fine. Of note the page is https, but per the urllib docs that shouldn't be an issue. It first requests the page and uses a regex to pull the download links from the page. From there the script goes into a loop to download the file and the inner loop looks like so:
dllink = m[0].replace('\">Download','')
print dllink
#m = re.findall('[a-f0-9]+.[\w]+',dllink)
extension = re.findall('.[\w]+$',dllink)[0]
fname = post_id + extension
urllib.urlretrieve(dllink,cpath + "/" + fname)
printLine(post_id + " ")
delay = random.uniform(32.0,64.0)
dlcount = dlcount + 1
time.sleep(delay)
Again, it downloads a file, but the files I'm downloading are on the order of 200k-4m and every file has started returning 4k. I've copy-pasted the download links into browsers and it pulls the right image and wget also downloads it just fine, so I'm not sure what's wrong with my code where it's only downloading 4k of the file. If this is a server side issue is there a way to call wget from python to accomplish the same thing without urlretreive? Thanks in advance for any help!
I am looking for a faster way to do my task. i have 40000 file downloadable urls. I would like to download them in local desktop is.now the thought is currently what I am doing is placing the link on the browser and then download them via a script.now what I am looking for is to place 10 urls in a chunk to the address bar and get the 10 files to be downloaded at the same time.If it possible hope overall time will be decreased.
Sorry I was late to give the code,here it is :
def _download_file(url, filename):
"""
Given a URL and a filename, this method will save a file locally to theĀ»
destination_directory path.
"""
if not os.path.exists(destination_directory):
print 'Directory [%s] does not exist, Creating directory...' % destination_directory
os.makedirs(destination_directory)
try:
urllib.urlretrieve(url, os.path.join(destination_directory, filename))
print 'Downloading File [%s]' % (filename)
except:
print 'Error Downloading File [%s]' % (filename)
def _download_all(main_url):
"""
Given a URL list, this method will download each file in the destination
directory.
"""
url_list = _create_url_list(main_url)
for url in url_list:
_download_file(url, _get_file_name(url))
Thanks,
Why use a browser? This seems like an XY problem.
To download files, I'd use a library like requests (or make a system call to wget).
Something like this:
import requests
def download_file_from_url(url, file_save_path):
r = requests.get(url)
if r.ok: # checks if the download succeeded
with file(file_save_path, 'w') as f:
f.write(r.content)
return True
else:
return r.status_code
download_file_from_url('http://imgs.xkcd.com/comics/tech_support_cheat_sheet.png', 'new_image.png')
# will download image and save to current directory as 'new_image.png'
You first have to install requests using whatever python package manager you prefer e.g., pip install requests. You can also get fancier; e.g.,