urllib: Get name of file from direct download link - python

Python 3. Probably need to use urllib to do this,
I need to know how to send a request to a direct download link, and get the name of the file it attempts to save.
(As an example, a KSP mod from CurseForge: https://kerbal.curseforge.com/projects/mechjeb/files/2355387/download)
Of course, the file ID (2355387) will be changed. It could be from any project, but always on CurseForge. (If that makes a difference on the way it's downloaded.)
That example link results in the file:
How can I return that file name in Python?
Edit: I should note that I want to avoid saving the file, reading the name, then deleting it if possible. That seems like the worst way to do this.

Using urllib.request, when you request a response from a url, the response contains a reference to the url you are downloading.
>>> from urllib.request import urlopen
>>> url = 'https://kerbal.curseforge.com/projects/mechjeb/files/2355387/download'
>>> response = urlopen(url)
>>> response.url
'https://addons-origin.cursecdn.com/files/2355/387/MechJeb2-2.6.0.0.zip'
You can use os.path.basename to get the filename:
>>> from os.path import basename
>>> basename(response.url)
'MechJeb2-2.6.0.0.zip'

from urllib import request
url = 'file download link'
filename = request.urlopen(request.Request(url)).info().get_filename()

Related

Python - Download zip files with requests package but get unknown file format

I am using Python 3.8.12. I tried the following code to download files from URLs with the requests package, but got 'Unkown file format' message when opening the zip file. I tested on different zip URLs but the size of all zip files are 18KB and none of the files can be opened successfully.
import requests
file_url = 'https://www.censtatd.gov.
hk/en/EIndexbySubject.html?pcode=D5600091&scode=300&file=D5600091B2022MM11B.zip'
file_download = requests.get(file_url, allow_redirects=True, stream=True)
open(save_path+file_name, 'wb').write(file_download.content)
Zip file opening error message
Zip files size
However, once I updated the url as file_url = 'https://www.td.gov.hk/datagovhk_tis/mttd-csv/en/table41a_eng.csv' the code worked well and the csv file could be downloaded perfectly.
I try to use requests, urllib , wget and zipfile io packages, but none of them work.
The reason may be that the zip URL directs to both the zip file and a web page, while the csv URL directs to the csv file only.
I am really new to this field, could anyone help on it? Thanks a lot!
You might examine headers after sending HEAD request to get information regarding file, examining Content-Type allows you to reveal actual type of file
import requests
file_url = 'https://www.censtatd.gov.hk/en/EIndexbySubject.html?pcode=D5600091&scode=300&file=D5600091B2022MM11B.zip'
r = requests.head(file_url)
print(r.headers["Content-Type"])
gives output
text/html
So file you have URL to is actually HTML page.
import wget
url = 'https://www.censtatd.gov.hk/en/EIndexbySubject.html?
pcode=D5600091&scode=300&file=D5600091B2022MM11B.zip'
#url = 'https://golang.org/dl/go1.17.3.windows-amd64.zip'
wget.download(url)

Python script to download PDF not downloading the PDF?

I have a Python 3.10 script to download a PDF from a URL, I get no errors but when I run the code the PDF does not download. I've done a sanity check to ensure the PDF is actually on the URL (which it is)
I'm not sure if this maybe has something to do with HTTP/ HTTPS? This site does have an expired HTTPS certificate, but it is a government site and this is really for testing only so I am not worried about that and can ignore the error
from fileinput import filename
import os
import os.path
from datetime import datetime
import urllib.request
import requests
import urllib3
urllib3.disable_warnings()
resp = requests.get('http:// url domain .org', verify=False)
urllib.request.urlopen('http:// my url .pdf')
filename = datetime.now().strftime("%Y_%m_%d-%I_%M_%S_%p")
save_path = "C:/Users/bob/Desktop/folder"
Or maybe is the issue something to do with urllib3 ignoring the error and urllib downloading the file?
Redacted the specific URL here
The urllib.request.urlopen method doesn't save the remote URL to a file -- it returns a response object that can be treated as a file-like object. You could do something like:
response = urllib.request.urlopen('http:// my url .pdf')
with open('filename.pdf') as fd:
fd.write(response.read())
The urllib.request.urlretrieve method, on the other hand, will take care of writing the remote content to a local file. You would use it like this to write the PDF file to a local file named filename.pdf:
response = urllib.request.urlretrieve('http://my url .pdf',
filename='filename.pdf')
See the documentation for information about the return value from the urlretrieve method.

Python download image from short url by keeping its own name

I would like to download image file from shortener url or generated url which doesn't contain file name on it.
I have tried to use [content-Disposition]. However my file name is not in ASCII code. So it can't print the name.
I have found out i can use urlretrieve, request to download file but i need to save as different name.
I want to download by keeping it's own name..
How can i do this?
matches = re.match(expression, message, re.I)
url = matches[0]
print(url)
original = urlopen(url)
remotefile = urlopen(url)
#blah = remotefile.info()['content-Disposition']
#print(blah)
#value, params = cgi.parse_header(blah)
#filename = params["filename*"]
#print(filename)
#print(original.url)
#filename = "filedown.jpg"
#urlretrieve(url, filename)
These are the list that i have try but none of them work
I was able to get this to work with the requests library because you can use it to get the url that the shortened url redirects to. Then, I applied your code to the redirected url and it worked. There might be a way to only use urllib (I assume thats what you are using) with this, but I dont know.
import requests
from urllib.request import urlopen
import cgi
def getFilenameFromURL(url):
req = requests.request("GET", url)
# req.url is now the url the shortened url redirects to
original = urlopen(req.url)
value, params = cgi.parse_header(original.info()['Content-Disposition'])
filename = params["filename*"]
print(filename)
return filename
getFilenameFromURL("https://shorturl.at/lKOY3")
You can then use urlretrieve with this. Its inefficient but it works... Also since you can get the actual url with the requests library, you can probably get the filname through there.

django how to download a file from the internet

I want to have a user input a file URL and then have my django app download the file from the internet.
My first instinct was to call wget inside my django app, but then I thought there may be another way to get this done. I couldn't find anything when I searched. Is there a more django way to do this?
You are not really dependent on Django for this.
I happen to like using requests library.
Here is an example:
import requests
def download(url, path, chunk=2048):
req = requests.get(url, stream=True)
if req.status_code == 200:
with open(path, 'wb') as f:
for chunk in req.iter_content(chunk):
f.write(chunk)
f.close()
return path
raise Exception('Given url is return status code:{}'.format(req.status_code))
Place this is a file and import into your module whenever you need it.
Of course this is very minimal but this will get you started.
You can use urlopen from urllib2 like in this example:
import urllib2
pdf_file = urllib2.urlopen("http://www.example.com/files/some_file.pdf")
with open('test.pdf','wb') as output:
output.write(pdf_file.read())
For more information, read the urllib2 docs.

How to test if a webpage is an image

Sorry that the title wasn't very clear, basically I have a list with a whole series of url's, with the intention of downloading the ones that are pictures. Is there anyway to check if the webpage is an image, so that I can just skip over the ones that arent?
Thanks in advance
You can use requests module. Make a head request and check the content type. Head request will not download the response body.
import requests
response = requests.head(url)
print response.headers.get('content-type')
There is no reliable way. But you could find a solution that might be "good enough" in your case.
You could look at the file extension if it is present in the url e.g., .png, .jpg could indicate an image:
>>> import os
>>> name = url2filename('http://example.com/a.png?q=1')
>>> os.path.splitext(name)[1]
'.png'
>>> import mimetypes
>>> mimetypes.guess_type(name)[0]
'image/png'
where url2filename() function is defined here.
You could inspect Content-Type http header:
>>> import urllib.request
>>> r = urllib.request.urlopen(url) # make HTTP GET request, read headers
>>> r.headers.get_content_type()
'image/png'
>>> r.headers.get_content_maintype()
'image'
>>> r.headers.get_content_subtype()
'png'
You could check the very beginning of the http body for magic numbers indicating image files e.g., jpeg may start with b'\xff\xd8\xff\xe0' or:
>>> prefix = r.read(8)
>>> prefix # .png image
b'\x89PNG\r\n\x1a\n'
As #pafcu suggested in the answer to the related question, you could use imghdr.what() function:
>>> import imghdr
>>> imghdr.what(None, b'\x89PNG\r\n\x1a\n')
'png'
You can use mimetypes https://docs.python.org/3.0/library/mimetypes.html
import urllib
from mimetypes import guess_extension
url="http://example.com/image.png"
source = urllib.urlopen(url)
extension = guess_extension(source.info()['Content-Type'])
print extension
this will return "png"

Categories

Resources