I am downloading file from the API URL http://api.worldbank.org/v2/en/topic/19?downloadformat=csv and We get file "API_19_DS2_en_csv_v2_10225248.zip" after hit.
Above URL does not contain "File name" like other URL "http://databank.worldbank.org/data/download/SE4ALL_csv.zip", here I can use
ntpath.basename(URL)
How to get file name?
Below code working
r = requests.get(Source_Link)
URL_Metadata = r.headers['Content-Disposition']
Source_File_Name = URL_Metadata[URL_Metadata.find('filename=')+9:]
Related
this is an URL example "https://procurement-notices.undp.org/view_file.cfm?doc_id=257280"
if you put it in the browser a file will start downloading in your system.
I want to download this file using python and store it somewhere on my computer
this is how tried
import requests
# first_url = 'https://readthedocs.org/projects/python-guide/downloads/pdf/latest/'
second_url="https://procurement-notices.undp.org/view_file.cfm?doc_id=257280"
myfile = requests.get(second_url , allow_redirects=True)
# this works for the first URL
# open('example.pdf' , 'wb').write(myfile.content)
# this did't work for both of them
# open('example.txt' , 'wb').write(myfile.content)
# this works for the second URL
open('example.doc' , 'wb').write(myfile.content)
first: if I put the first_url in the browser it will download a pdf file, putting second_url will download a .doc file How can I know what type of file will the URL give to us or what type of file will be downloaded so that I use the correct open(...) method?
second: If I use the second URL in the browser a file with the name "T__proc_notices_notices_080_k_notice_doc_79545_770020123.docx" starts downloading. how can I know this file name when I try to download the file?
if you know any better solution kindly let me know for the implementation.
kindly have a quick look at Downloading Files from URLs and zip downloaded files in python question aswell
myfile.headers['content-type'] will give you the MIME-type of the URL's content and myfile.headers['content-disposition'] gives you info like filename etc. (if the response contains this header at all)
you can use response headers content-type like for first url it is application/pdf and sencond url for is application/msword you save file according to it. you can make extension dictinary where you can store possible file format and their types and match with it. your second question is also same like this one so i am taking your two urls from that question and for file name i am using just integers
all_Urls = ['https://omextemplates.content.office.net/support/templates/en-us/tf16402488.dotx' ,
'https://procurement-notices.undp.org/view_file.cfm?doc_id=257280']
extension_dict = {'application/vnd.openxmlformats-officedocument.wordprocessingml.document':'.docx',
'application/vnd.openxmlformats-officedocument.wordprocessingml.template':'.dotx',
'application/vnd.ms-word.document.macroEnabled.12':'.docm',
'application/vnd.ms-word.template.macroEnabled.12':'.dotm',
'application/pdf':'.pdf',
'application/msword':'.doc'}
for i,url in enumerate(all_Urls):
resp = requests.get(url)
response_headers = resp.headers
file_extension = extensio_dict[response_headers['Content-Type']]
with open(f"{i}.{file_extension}",'wb') as f:
f.write(resp.content)
for MIME-Type see this answer
I am downloading some files from the FAO GAEZ database, which uses HTTP POST based login from.
I am thus using the requests module. Here is my code:
my_user = "blabla"
my_pass = "bleble"
site_url = "http://www.gaez.iiasa.ac.at/w/ctrl?_flow=Vwr&_view=Welcome&fieldmain=main_lr_lco_cult&idPS=0&idAS=0&idFS=0"
file_url = "http://www.gaez.iiasa.ac.at/w/ctrl?_flow=VwrServ&_view=AAGrid&idR=m1ed3ed864793f16e83ba9a5a975066adaa6bf1b0"
with requests.Session() as s:
s.get(site_url)
s.post(site_url, data={'_username': 'my_user', '_password': 'my_pass'})
r = s.get(file_url)
if r.ok:
with open(my_path + "\\My file.zip", "wb") as c:
c.write(r.content)
However, with this procedure I download the HTML of the page.
I suspect that to solve the problem I have to add the name of the zip file to the url, i.e. new_file_url = file_url + "/file_name.zip". The problem is that I don't know the "file_name". I've tried with the name of the file which I obtain when I download it manually, but it does not work.
Any of idea on how to solve this? If you need more details on GAEZ website, see also: Python - Login and download specific file from website
I'm saving HTTP request as an HTML page.
How can I save the HTML file with the name of the URL.
I'm using Linux OS
So the file name will look like this: "http://www.test.com.html"
My code:
url = "http://www.test.com"
page = urllib.urlopen(url).read()
f = open("./file.html", "w")
f.write(page)
f.close()
Unfortunately you cannot save a file with a url name. The character "/" is not allowed in Windows file names.
However, you can create a file with the name www.test.com.html with the following line
file_name = url.split('/')[2]
If you need to anything like https://www.test.com/posts/1, you can try to replace / with another custom character that usually not occurs in url such as __
url = 'https://www.test.com/posts/11111111'
file_name = '__'.join(url.split('/')[2:])
Will result in
www.test.com__posts__1
I am working on an API which returns me a document ID and then I can use that document ID to get the PDF. For example document_id = 'fanlfe48ry4ihkefewfl934'. Now I concatenate this id like the following
document_id = 'fanlfe48ry4ihkefewfl934'
full_url = url + document_id
from urllib.request import urlopen
response = urlopen(full_url)
file = open("document.pdf", 'w')
file.write(response.read())
file.close()
But it is getting just an html response. The reason is that the url is not a file with .pdf extension but it is a url which when clicked, pops up the window for save location and then it saves the file as pdf.
I don't understand how to handle this situation
I am trying to download a pdf file from a website with authentication and save it locally. This code appears to run but saves a pdf file that cannot be opened ("it is either not a supported file type or because the file has been damaged").
import urllib.request
auth_handler = urllib.request.HTTPBasicAuthHandler()
auth_handler.add_password(realm=None,
uri=r'http://website/',
user='admin',
passwd='pass')
opener = urllib.request.build_opener(auth_handler)
urllib.request.install_opener(opener)
url = 'http://www.website.com/example.pdf'
res = opener.open(url)
urllib.request.urlretrieve(url, "example.pdf")
Sounds like you have a bad URL. Make sure you get the ".pdf" file on your browser when you enter that URL in the browser.
EDIT:
I meant to say, your URL should be like this: "http://www.cse.msu.edu/~chooseun/Test2.pdf" Your code must be able to pull off this pdf from the web address. Hope this helps.
I think the problem is with "urllib.request.urlretrieve(url, "example.pdf")". After you get through the authentication save the file using something like this instead:
pdfFile = urllib.request.urlopen(url)
file = open("example.pdf", 'wb')
file.write(pdfFile.read())
file.close()