How to download file from link like this? - python

I want to ask how to download file using Python from a link like this, I crawled through the stack for a while and didn't find anything that works.
I got a link to a file, something like this:
https://w3.google.com/tools/cio/forms/anon/org/contentload?content=https://w3.ibm.com/tools/cio/forms/secure/org/data/f48f2294-495b-48f5-8d4e-e418f4b25a48/F_Form1/attachment/bba4ddfd-837d-47a6-87ef-2114f6b3da08 (link doesn't work, just showing you how it should look)
And after clicking on it it opens a browser and starts opening file:
I don't know how the file will be named or what format file will have, I only have a URL that links to file like this image up.
I tried this:
def Download(link):
r = requests.get(link)
with open('filename.docx', 'wb') as f:
f.write(r.content)
But this definitely doesn't work, as you can see I manually put the name of the file because it desperate but it doesn't work either, it makes file but only 1kb size and nothing in it.
I don't know how to code it to automatically download it from links like this? Can you help?

use urlretrieve from urllib. See here

You can use urllib.request.urlretrieve to get the contents of the file.
Example:
import urllib.request
with open('filename.docx', 'wb') as f:
f.write(urllib.request.urlretrieve("https://w3.google.com/tools/cio/forms/anon/org/contentload?content=https://w3.ibm.com/tools/cio/forms/secure/org/data/f48f2294-495b-48f5-8d4e-e418f4b25a48/F_Form1/attachment/bba4ddfd-837d-47a6-87ef-2114f6b3da08"))

Related

Downloading files from urls listed in txt file without using wget

Due to not being able to install wget library at my work I need a workaround for downloading files using URLs listed in a txt file. I have txt file called urls.txt which contains about a thousand of links each directing to the file that needs to be downloaded. So far I have something like this but unfortunately it isn't downloading any files although script is being executed.
import urllib.request
with open("urls.txt", "r") as file:
linkList = file.readlines()
for link in linkList:
urllib.request.urlretrieve(link)
The second argument, if present, specifies the file location to copy to (if absent, the location will be a tempfile with a generated name)
From the docs.
You'll need to specify a second argument referring to a file path to which to download the file's contents to, like so:
...
for link in linkList:
urllib.request.urlretrieve(link, link.split('/')[-1])
As it stands, you're downloading into a temp file with a generated name. I'm not 100% sure how you're meant to retrieve that name, so it's best to just specify the file path yourself.

accessing source code from a local file python

I need to access the source code of a locally saved file, but I need to automate this because there are multiple files in one folder. I've looked at the inspect module and the selenium module, but I still understand what to do. After accessing the source code, I need to use bs4 to extract from it.
I've read several posts on here and elsewhere with similar problems, but the thing is that my file does not open in the source code (it's written in xml and so far everything needs to be in source code before you can use these modules). If I open the file, it just uses my browser to open a regular page and then I have to click view page source.
How can I automate this so that it will open the page, go to the source code, and save it so I can stick it into a soup for later parsing?
path_g_jurt = r'C:\Users\g\Desktop\t\SDU\jurt htmls\jurt\meta jurt'
file = r'C:\Users\g\Desktop\t\SDU\jurt htmls\jurt\meta jurt' + "/" + file
for file in path_g_jurt:
if file.endswith(".xhtml"):
with open(file, encoding = "utf-8") as mdata_jurt:
soup = BeautifulSoup(mdata_jurt)
main = file.find("jcid").get_text()
misc_links = []
for item in file.find_all("regelgeving"):
misc = item.find("misc:link")
misc_links.append(misc.get("misc:jcid"))
Any help would be appreciated.

Python 3.3 Code to Download a file to a location and save as a given file name

For example I would like to save the .pdf file # http://arxiv.org/pdf/1506.07825 with the filename: 'Data Assimilation- A Mathematical Introduction' at the location 'D://arXiv'.
But I have many such files. So, my input is of the form of a .csv file with rows given by (semi-colon is the delimiter):
url; file name; location.
I found some code here: https://github.com/ravisvi/IDM
But that is a bit advanced for me to parse. I want to start with something simpler. The above seems to have more functionality than I need right now - threading, pausing etc.
So can you please write me a very minimal code to do the above:
save the file 'Data Assimilation- A Mathematical Introduction'
from 'http://arxiv.org/pdf/1506.07825'
at 'D://arXiv'?
I think I will be able to generalize it to deal with a .csv file.
Or, hint me a place to get started. (The github repository already has a solution, and it is too perfect! I want something simpler.) My guess is, with Python, a task as above should be possible with no more than 10 lines of code. So tell me important ingredients of the code, and perhaps I can figure it out.
Thanks!
I would use the requests module, you can just pip install requests.
Then, the code is simple:
import requests
response = requests.get(url)
if response.ok:
file = open(file_path, "wb+") # write, binary, allow creation
file.write(response.content)
file.close()
else:
print("Failed to get the file")
Using Python 3.6.5
Here is a method that can create a folder and save the file in a folder.
dataURL - Complete URL path
data_path - Where the file needs to be saved.
tgz_path - Name of the datafile with the extension.
def fetch_data_from_tar(data_url,data_path,tgz_path):
if not os.path.isdir(data_path):
os.mkdir(data_path)
print ("Data Folder Created # Path", data_path)
else:
print("Folder path already exists")
tgz_path = os.path.join(data_path,tgz_path)
urllib.request.urlretrieve(data_url,filename=tgz_path)
data_tgz = tarfile.open(tgz_path)
data_tgz.extractall(path=data_path)
data_tgz.close()

Opening a text file from the private folder in web2py

I need to open a database (in .txt format) for my search engine script in web2py.
I can not access the online database, because I use the free version of pythonanywhere.
import urllib
infile=urllib.urlopen('http://database.net')
for line in infile:
Now I uploaded the database to the "private" file folder and I wonder how get access it. It looks like a simple question, but I can't seem to work it.
I need something like this:
infile = open('searchapp/private/database.txt')
for line in infile:
What is a good solution?
This should do:
import os
infile = open(os.path.join(request.folder, 'private', 'database.txt'))
for line in infile:
http://www.web2py.com/books/default/chapter/29/04/the-core#request
http://docs.python.org/2/library/os.path.html#os.path.join

Naming a file when downloading with Selenium Webdriver

I see that you can set where to download a file to through Webdriver, as follows:
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.folderList",2)
fp.set_preference("browser.download.manager.showWhenStarting",False)
fp.set_preference("browser.download.dir",getcwd())
fp.set_preference("browser.helperApps.neverAsk.saveToDisk","text/csv")
browser = webdriver.Firefox(firefox_profile=fp)
But, I was wondering if there is a similar way to give the file a name when it is downloaded? Preferably, probably not something that is associated with the profile, as I will be downloading ~6000 files through one browser instance, and do not want to have to reinitiate the driver for each download.
I would suggest a little bit strange way: do not download files with the use of Selenium if possible.
I mean get the file URL and use urllib library to download the file and save it to disk in a 'manual' way. The issue is that selenium doesn't have a tool to handle Windows dialogs, such as 'save as' dialog. I'm not sure, but I doubt that it can handle any OS dialogs at all, please correct me I'm wrong. :)
Here's a tiny example:
import urllib
urllib.urlretrieve( "http://www.yourhost.com/yourfile.ext", "your-file-name.ext")
The only job for us here is to make sure that we handle all the urllib Exceptions. Please see http://docs.python.org/2/library/urllib.html#urllib.urlretrieve for more info.
I do not know if there is a pure Selenium handler for this, but here is what I have done when I needed to do something with the downloaded file.
Set a loop that polls your download directory for the latest file that does not have a .part extension (this indicates a partial download and would occasionally trip things up if not accounted for. Put a timer on this to ensure that you don't go into an infinite loop in the case of timeout/other error that causes the download not to complete. I used the output of the ls -t <dirname> command in Linux (my old code uses commands, which is deprecated so I won't show it here :) ) and got the first file by using
# result = output of ls -t
result = result.split('\n')[1].split(' ')[-1]
If the while loop exits successfully, the topmost file in the directory will be your file, which you can then modify using os.rename (or anything else you like).
Probably not the answer you were looking for, but hopefully it points you in the right direction.
Solution with code as suggested by the selected answer. Rename the file after each one is downloaded.
import os
os.chdir(SAVE_TO_DIRECTORY)
files = filter(os.path.isfile, os.listdir(SAVE_TO_DIRECTORY))
files = [os.path.join(SAVE_TO_DIRECTORY, f) for f in files] # add path to each file
files.sort(key=lambda x: os.path.getmtime(x))
newest_file = files[-1]
os.rename(newest_file, docName + ".pdf")
This answer was posted as an edit to the question naming a file when downloading with Selenium Webdriver by the OP user1253952 under CC BY-SA 3.0.

Categories

Resources