I am writing small python code to download a file from follow link and retrieve original filename
and its extension.But I have come across one such follow link for which python downloads the file but it is without any extension whereas file has .txt extension when downloads using browser.
Below is the code I am trying :
from urllib.request import urlopen
from urllib.parse import unquote
import wget
filePath = 'D:\\folder_path'
followLink = 'http://example.com/Reports/Download/c4feb46c-8758-4266-bec6-12358'
response = urlopen(followLink)
if response.code == 200:
print('Follow Link(response url) :' + response.url)
print('\n')
unquote_url = unquote(response.url)
file_name = wget.detect_filename(response.url).replace('|', '_')
print('file_name - '+file_name)
wget.download(response.url,filePa
th)
file_name variable in above code is just giving 'c4feb46c-8758-4266-bec6-12358' as filename.
Where I want to download it as c4feb46c-8758-4266-bec6-12358.txt.
I have also tried to read file name from header i.e. response.info(). But not getting proper file name.
Anyone can please help me with this.I am stucked in my work.Thanks in advance.
Wget gets the filename from the URL itself. For example, if your URL was https://someurl.com/filename.pdf, it is saved as filename.pdf. If it was https://someurl.com/filename, it is saved as filename. Since wget.download returns the filename of the downloaded file, you can rename it to any extension you want with os.rename(filename, filename+'.<extension>').
Related
I am using Python 3.8.12. I tried the following code to download files from URLs with the requests package, but got 'Unkown file format' message when opening the zip file. I tested on different zip URLs but the size of all zip files are 18KB and none of the files can be opened successfully.
import requests
file_url = 'https://www.censtatd.gov.
hk/en/EIndexbySubject.html?pcode=D5600091&scode=300&file=D5600091B2022MM11B.zip'
file_download = requests.get(file_url, allow_redirects=True, stream=True)
open(save_path+file_name, 'wb').write(file_download.content)
Zip file opening error message
Zip files size
However, once I updated the url as file_url = 'https://www.td.gov.hk/datagovhk_tis/mttd-csv/en/table41a_eng.csv' the code worked well and the csv file could be downloaded perfectly.
I try to use requests, urllib , wget and zipfile io packages, but none of them work.
The reason may be that the zip URL directs to both the zip file and a web page, while the csv URL directs to the csv file only.
I am really new to this field, could anyone help on it? Thanks a lot!
You might examine headers after sending HEAD request to get information regarding file, examining Content-Type allows you to reveal actual type of file
import requests
file_url = 'https://www.censtatd.gov.hk/en/EIndexbySubject.html?pcode=D5600091&scode=300&file=D5600091B2022MM11B.zip'
r = requests.head(file_url)
print(r.headers["Content-Type"])
gives output
text/html
So file you have URL to is actually HTML page.
import wget
url = 'https://www.censtatd.gov.hk/en/EIndexbySubject.html?
pcode=D5600091&scode=300&file=D5600091B2022MM11B.zip'
#url = 'https://golang.org/dl/go1.17.3.windows-amd64.zip'
wget.download(url)
I am re-framing an existing question for simplicity. I have the following code to download Excel files from a company Share Point site.
import requests
import pandas as pd
def download_file(url):
filename = url.split('/')[-1]
r = requests.get(url)
with open(filename, 'wb') as output_file:
output_file.write(r.content)
df = pd.read_excel(r'O:\Procurement Planning\QA\VSAF_test_macro.xlsm')
df['Name'] = 'share_point_file_path_documentName' #i'm appending the sp file path to the document name
file = df['Name'] #I only need the file path column, I don't need the rest of the dataframe
# for loop for download
for url in file:
download_file(url)
The downloads happen and I don't get any errors in Python, however when I try to open them I get an error from Excel saying Excel cannot open the file because the file format or extension is not valid. If I print the link in Jupyter Notebooks it does open correctly, the issue appears to be with the download.
Check r.status_code. This must be 200 or you have the wrong url or no permission.
Open the downloaded file in a text editor. It might be a HTML file (Office Online)
If the URL contains a web=1 query parameter, remove it or replace it by web=0.
I've downloaded some files using requests
url = 'https://www.youtube.com/watch?v=gp5tziO5lXg&feature=youtu.be'
video_name = url.split('/')[-1]
print("Downloading file:%s" % video_name)
# download the url contents in binary format
r = requests.get(url)
# open method to open a file on your system and write the contents
with open('saved.mp4', 'wb') as f:
f.write(r.content)
and using urllib.requests
url = 'https://www.youtube.com/watch?v=gp5tziO5lXg&feature=youtu.be'
video_name = url.split('/')[-1]
print("Downloading file:%s" % video_name)
# Copy a network object to a local file
urllib.request.urlretrieve(url, "saved2.mp4")
When I then try to open the .mp4 file I get the following error
Cannot play
This file cannot be played. This can happen because the file type is
not supported, the file extension is incorrect or the file is
corrupted.
0xc00d36c4
If I test it with pytube it works fine.
What's wrong with the other methods?
To answer your question, with the other methods it is not downloading the video but the page. What you may be obtaining is an html file with an mp4 file extension.
Therefore, it gives that error when trying to open the file.
If pytube works for what you need, I would suggest using that one.
If you want to download videos from other platforms, you might consider youtube-dl.
Hello you can import IPython.display for audio diplay
import IPython.display as ipd
ipd.Audio(video_name)
regards
I hope I can have solved your problem
My first question on Stack Overflow!
I'm trying to download resumes of a job posting website. I've found the link that leads to the download, but those downloads have a '.php' ending, and hence I don't know the extension of the file that is going be downloaded (.doc, .docx, .pdf)The relevant last section of the link looks this: ("~/resumedownload.php?f=WFeilbBZWg==")
I'm logging into the website with mechanize. I've used mechanize to login the website, and this what I do to download the file:
filename = br.retrieve(link.get('href'),
os.path.expanduser("~/Desktop/Job Postings/Hirist/" + str(i) +
".pdf"))[0]
, but this only brings back the .pdf files and corrupts the rest. The filename variable is a .php file.
Any suggestions?
Browser.retrieve() returns a tuple consisting of the filename to which the file was written and the headers from the remote server. You can then use the Content-Type header to determine the MIME type of the file and the mimetypes module to get an appropriate extension for the file. Finally, rename the file.
import mechanize
import shutil
import os.path
import mimetypes
#url = 'http://stackoverflow.com'
url = 'http://heriverde.nimoz.pl/wp-content/uploads/pdf-sample.pdf'
br = mechanize.Browser()
filename, headers = br.retrieve(url)
dest_dir = os.path.expanduser('~/Desktop/Job Postings/Hirist/')
# Content-Type may include encoding, e.g. text/html; charset=utf-8
content_type = headers.get('Content-Type', '').split(';')[0]
extension = mimetypes.guess_extension(content_type)
if not extension:
extension = '.dunno'
# `i` is assumed to be a counter
dest_filename = '{}{}'.format(i, extension)
shutil.move(filename, os.path.join(dest_dir, dest_filename))
I can download a file from URL the following way.
import urllib2
response = urllib2.urlopen("http://www.someurl.com/file.pdf")
html = response.read()
One way I can think of is open this file as binary and then resave it to the differnet folder I want to save
but is there a better way?
Thanks
You can use the python module wget for downloading the file. Here is a sample code
import wget
url = 'http://www.example.com/foo.zip'
path = 'path/to/destination'
wget.download(url,out = path)
The function you are looking for is urllib.urlretrieve
import urllib
linkToFile = "http://www.someurl.com/file.pdf"
localDestination = "/home/user/local/path/to/file.pdf"
resultFilePath, responseHeaders = urllib.urlretrieve(linkToFile, localDestination)