urllib2 File Download Fails...due to System Security? - python

I'm trying to download files (approximately 1 - 1.5MB/file) from a NASA server (URL), but to no avail! I've tried a few things with urllib2 and run into two results:
I create a new file on my machine that is only ~200KB and has nothing in it
I create a 1.5MB file on my machine that has nothing in it!
By "nothing in it" I mean when I open the file (these are hdf5 files, so I open them in hdfView) I see no hierarchical structure...literally looks like an empty h5 file. But, when I open the file in a text editor I can see there is SOMETHING there (it's binary, so in text it looks like...well, binary).
I think I am using urllib2 appropriately, though I have never successfully used urllib2 before. Would you please comment on whether what I am doing is right or not, and suggest something better?
from urllib2 import Request, urlopen, URLError, HTTPError
base_url = 'http://avdc.gsfc.nasa.gov/index.php?site=1480884223&id=40&go=list&path=%2FH2O%2F/2010'
file_name = 'download_2.php?site=1480884223&id=40&go=download&path=%2FH2O%2F2010&file=MLS-Aura_L2GP-H2O_v03-31-c01_2010d360.he5'
url = base_url + file_name
req = Request(url)
# Open the url
try:
f = urlopen(req)
print "downloading " + url
# Open our local file for writing
local_file = open('test.he5', "w" + file_mode)
#Write to our local file
local_file.write(f.read())
local_file.close()
except HTTPError, e:
print "HTTP Error:",e.code , url
except URLError, e:
print "URL Error:",e.reason , url
I got this script (which seems to be the closest to working) from here.
I am unsure what the file_name should be. I looked at the page source information of the archive and pulled the file name listed there (not the same as what shows up on the web page), and doing this yields the 1.5MB file that shows nothing in hdfview.

You are creating an invalid url:
base_url = 'http://avdc.gsfc.nasa.gov/index.php?site=1480884223&id=40&go=list&path=%2FH2O%2F/2010'
file_name = 'download_2.php?site=1480884223&id=40&go=download&path=%2FH2O%2F2010&file=MLS-Aura_L2GP-H2O_v03-31-c01_2010d360.he5'
url = base_url + file_name
You probably meant:
base_url = 'http://avdc.gsfc.nasa.gov/'
file_name = 'download_2.php?site=1480884223&id=40&go=download&path=%2FH2O%2F2010&file=MLS-Aura_L2GP-H2O_v03-31-c01_2010d360.he5'
When downloading a large file, it's better to use a buffered copy from filehandle to filehandle:
import shutil
# ...
f = urlopen(req)
with open('test.he5', "w" + file_mode) as local_file:
shutil.copyfileobj(f, local_file)
.copyfileobj will efficiently load from the open urllib connection and write to the open local_file file handle. Note the with statement, when the code block underneath concludes it'll automatically close the file for you.

Related

python : wget module downloading file without any extension

I am writing small python code to download a file from follow link and retrieve original filename
and its extension.But I have come across one such follow link for which python downloads the file but it is without any extension whereas file has .txt extension when downloads using browser.
Below is the code I am trying :
from urllib.request import urlopen
from urllib.parse import unquote
import wget
filePath = 'D:\\folder_path'
followLink = 'http://example.com/Reports/Download/c4feb46c-8758-4266-bec6-12358'
response = urlopen(followLink)
if response.code == 200:
print('Follow Link(response url) :' + response.url)
print('\n')
unquote_url = unquote(response.url)
file_name = wget.detect_filename(response.url).replace('|', '_')
print('file_name - '+file_name)
wget.download(response.url,filePa
th)
file_name variable in above code is just giving 'c4feb46c-8758-4266-bec6-12358' as filename.
Where I want to download it as c4feb46c-8758-4266-bec6-12358.txt.
I have also tried to read file name from header i.e. response.info(). But not getting proper file name.
Anyone can please help me with this.I am stucked in my work.Thanks in advance.
Wget gets the filename from the URL itself. For example, if your URL was https://someurl.com/filename.pdf, it is saved as filename.pdf. If it was https://someurl.com/filename, it is saved as filename. Since wget.download returns the filename of the downloaded file, you can rename it to any extension you want with os.rename(filename, filename+'.<extension>').

Save streaming audio from URL as MP3, or even just audio file from URL as MP3

I am trying to have my server, in python 3, go grab files from URLs. Specifically, I would like to pass a URL into a function, I would like the function to go grab an audio file(of many varying formats) and save it as an MP3, probably using ffmpeg or ffmpy. If the URL also has a PDF, I would also like to save that, as a PDF. I haven't done much research on the PDF yet, but I have been working on the audio piece and wasn't sure if this was even possible.
I have looked at several questions here, but most notably;
How do I download a file over HTTP using Python?
It's a little old but I tried several methods in there and always get some sort of issue. I have tried using the requests library, urllib, streamripper, and maybe one other.
Is there a way to do this and with a recommended library?
For example, most of the ones I have tried do save something, like the html page, or an empty file called 'file.mp3' in this case.
Streamripper received a try changing user agents error.
I am not sure if this is possible, but I am sure there is something I'm not understanding here, could someone point me in the right direction?
This isn't necessarily the code I'm trying to use, just an example of something I have used that doesn't work.
import requests
url = "http://someurl.com/webcast/something"
r = requests.get(url)
with open('file.mp3', 'wb') as f:
f.write(r.content)
# Retrieve HTTP meta-data
print(r.status_code)
print(r.headers['content-type'])
print(r.encoding)
**Edit
import requests
import ffmpy
import datetime
import os
## THIS SCRIPT CAN BE PASSED A URL AND IF THE URL RETURNS
## HTTP HEADER FOR CONTENT TYPE AUDIO/MPEG, THE FILE WILL
## BE SAVED AS THE CURRENT-DATE-AND-TIME.MP3
##
## THIS SCRIPT CAN BE PASSED A URL AND IF THE URL RETURNS
## HTTP HEADER FOR CONTENT TYPE application/pdf, THE FILE WILL
## BE SAVED AS THE CURRENT-DATE-AND-TIME.PDF
##
## THIS SCRIPT CAN BE PASSED A URL AND IF THE URL RETURNS
## HTTP HEADER FOR CONTENT TYPE other than application/pdf, OR
## audio/mpeg, THE FILE WILL NOT BE SAVED
def BordersPythonDownloader(url):
print('Beginning file download requests')
r = requests.get(url, stream=True)
contype = r.headers['content-type']
if contype == "audio/mpeg":
print("audio file")
filename = '[{}].mp3'.format(str(datetime.datetime.now()))
with open('file.mp3', 'wb+') as f:
f.write(r.content)
ff = ffmpy.FFmpeg(
inputs={'file.mp3': None},
outputs={filename: None}
)
ff.run()
if os.path.exists('file.mp3'):
os.remove('file.mp3')
elif contype == "application/pdf":
print("pdf file")
filename = '[{}].pdf'.format(str(datetime.datetime.now()))
with open(filename, 'wb+') as f:
f.write(r.content)
else:
print("URL DID NOT RETURN AN AUDIO OR PDF FILE, IT RETURNED {}".format(contype))
# INSERT YOUR URL FOR TESTING
# OR CALL THIS SCRIPT FROM ELSEWHERE, PASSING IT THE URL
#DEFINE YOUR URL
#url = 'http://archive.org/download/testmp3testfile/mpthreetest.mp3'
#CALL THE SCRIPT; PASSING IT YOUR URL
#x = BordersPythonDownloader(url)
#ANOTHER EXAMPLE WITH A PDF
#url = 'https://www.cisco.com/c/en/us/td/docs/switches/lan/catalyst6500/ios/12-2SY/configuration/guide/sy_swcg/etherchannel.pdf'
#x = BordersPythonDownloader(url)
Thanks Richard, this code works and helps me understand this better. Any suggestions for improving the above working example?

How can I input a filename and download the file in Python?

I have data base of file. I'm writing a program to ask the user to input file name and using that input to find the file, download it,make a folder locally and save the file..which module in Python should be used?
Can be as small as this:
import requests
my_filename = input('Please enter a filename:')
my_url = 'http://www.somedomain/'
r = requests.get(my_url + my_filename, allow_redirects=True)
with open(my_filename, 'wb') as fh:
fh.write(r.content)
Well, do you have the database online?
If so I would suggest you the requests module, very pythonic and fast.
Another great module based on requests is robobrowser.
Eventually, you may need beautiful soup to parse the HTML or XML data.
I would avoid using selenium because it's designed for web-testing, it needs a browser and its webdriver and it's pretty slow. It doesn't fit your needs at all.
Finally, to interact with the database I'd use sqlite3
Here a sample:
from requests import Session
import os
filename = input()
with Session() as session:
url = f'http://www.domain.example/{filename}'
try:
response = session.get(url)
except requests.exceptions.ConnectionError:
print('File not existing')
download_path = f'C:\\Users\\{os.getlogin()}\\Downloads\\your application'
os.makedirs(dowload_path, exist_ok=True)
with open(os.path.join(download_path, filename), mode='wb') as dbfile:
dbfile.write(response.content)
However, you should read how to ask a good question.

Downloading pdf with urllib.requests writes a pdf that cannot be opened

I am trying to download a pdf file from a website with authentication and save it locally. This code appears to run but saves a pdf file that cannot be opened ("it is either not a supported file type or because the file has been damaged").
import urllib.request
auth_handler = urllib.request.HTTPBasicAuthHandler()
auth_handler.add_password(realm=None,
uri=r'http://website/',
user='admin',
passwd='pass')
opener = urllib.request.build_opener(auth_handler)
urllib.request.install_opener(opener)
url = 'http://www.website.com/example.pdf'
res = opener.open(url)
urllib.request.urlretrieve(url, "example.pdf")
Sounds like you have a bad URL. Make sure you get the ".pdf" file on your browser when you enter that URL in the browser.
EDIT:
I meant to say, your URL should be like this: "http://www.cse.msu.edu/~chooseun/Test2.pdf" Your code must be able to pull off this pdf from the web address. Hope this helps.
I think the problem is with "urllib.request.urlretrieve(url, "example.pdf")". After you get through the authentication save the file using something like this instead:
pdfFile = urllib.request.urlopen(url)
file = open("example.pdf", 'wb')
file.write(pdfFile.read())
file.close()

can python be useful to open multiple tabs in a browser in one shot?

I am looking for a faster way to do my task. i have 40000 file downloadable urls. I would like to download them in local desktop is.now the thought is currently what I am doing is placing the link on the browser and then download them via a script.now what I am looking for is to place 10 urls in a chunk to the address bar and get the 10 files to be downloaded at the same time.If it possible hope overall time will be decreased.
Sorry I was late to give the code,here it is :
def _download_file(url, filename):
"""
Given a URL and a filename, this method will save a file locally to theĀ»
destination_directory path.
"""
if not os.path.exists(destination_directory):
print 'Directory [%s] does not exist, Creating directory...' % destination_directory
os.makedirs(destination_directory)
try:
urllib.urlretrieve(url, os.path.join(destination_directory, filename))
print 'Downloading File [%s]' % (filename)
except:
print 'Error Downloading File [%s]' % (filename)
def _download_all(main_url):
"""
Given a URL list, this method will download each file in the destination
directory.
"""
url_list = _create_url_list(main_url)
for url in url_list:
_download_file(url, _get_file_name(url))
Thanks,
Why use a browser? This seems like an XY problem.
To download files, I'd use a library like requests (or make a system call to wget).
Something like this:
import requests
def download_file_from_url(url, file_save_path):
r = requests.get(url)
if r.ok: # checks if the download succeeded
with file(file_save_path, 'w') as f:
f.write(r.content)
return True
else:
return r.status_code
download_file_from_url('http://imgs.xkcd.com/comics/tech_support_cheat_sheet.png', 'new_image.png')
# will download image and save to current directory as 'new_image.png'
You first have to install requests using whatever python package manager you prefer e.g., pip install requests. You can also get fancier; e.g.,

Categories

Resources