I am attempting to download a file from the internet using Python along with the sys and urllib2 modules. The general idea behind the program is for the user to input the version of the file they want to download, 1_4 for example. The program then adds the user input and the "/whateverfile.jar" to the url and downloads the file. My problem arises when the program inserts the "/whateverfile.jar" instead of inserting onto the same line the program inserts the "/whateverfile.jar" onto a new line. Which causes the program to fail to download the .jar properly.
Can anyone help me with this? The code and output is below.
Code:
import sys
import urllib2
print('Type version of file you wish to download.')
print('To download 1.4 for instance type "1_4" using underscores in place of the periods.')
W = ('http://assets.file.net/')
X = sys.stdin.readline()
Y = ('/file.jar')
Z = X+Y
V = W+X
U = V+Y
T = U.lstrip()
print(T)
def JarDownload():
url = "T"
file_name = url.split('/')[-1]
u = urllib2.urlopen(url)
f = open(file_name, 'wb')
meta = u.info()
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (file_name, file_size)
file_size_dl = 0
block_sz = 8192
while True:
buffer = u.read(block_sz)
if not buffer:
break
file_size_dl += len(buffer)
f.write(buffer)
status = r"%10d [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
status = status + chr(8)*(len(status)+1)
print status,
f.close()
Output:
Type version of file you wish to download.
To download 1.4 for instance type "1_4" using underscores in place of the periods.
1_4
http://assets.file.net/1_4
/file.jar
I am currently not calling the JarDownload() function at all until the URL will display as a single line when printed to screen
When you type the input and hit Return, the sys.stdin.readline() call will append the new line symbol to the string and return it. To get the desired effect, you should strip the new line from the input before using it. This should work:
X = sys.stdin.readline().rstrip()
As a side note, you should probably give more meaningful names to your variables. Names like X, Y, Z, etc. say nothing about the variables content and make even simple operations, like your concatenations, unnecessarily hard to understand.
Related
I found this code here which monitors the progress of downloads. -
import urllib2
url = "http://download.thinkbroadband.com/10MB.zip"
file_name = url.split('/')[-1]
u = urllib2.urlopen(url)
f = open(file_name, 'wb')
meta = u.info()
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (file_name, file_size)
file_size_dl = 0
block_sz = 8192
while True:
buffer = u.read(block_sz)
if not buffer:
break
file_size_dl += len(buffer)
f.write(buffer)
status = r"%10d [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
status = status + chr(8)*(len(status)+1)
print status,
f.close()
I do not see the block size being modified at any point in the while loop, so, to me, buffer = u.read(block_sz) should keep reading the same thing over and over again.
We know this doesn't happen, is it because read() in while loop has a built in pointer that starts reading from where you left off last time?
What about write()? Does it keep appending after where it last left off, even though the file is not opened in append mode?
File objects and network sockets and other forms of I/O communication are data streams. Reading from them always produces the next section of data, calling .read() with a buffer size does not re-start the stream from the beginning.
So yes, there is a virtual 'position' in streams where you are reading from.
For files, that position is called the file pointer, and this pointer advances both for reading and writing. You can alter the position by using seeking, or by simply re-opening the file. You can ask a file object to tell you the current position, too.
For network sockets however, you can only go forward; the data is received from outside your computer and reading consumes that data.
I' m trying to download a file from the internet using python.
I' ve tryed this code:
import urllib.requests
URL = 'http://www.mediafire.com/download/raju14e8aq6azbo/Getting+Started+with+MediaFire.pdf'
filename = "file.pdf"
urllib.request.urlretrieve(URL,filename)
and:
from urllib.request import urlopen
from shutil import copyfileobj
URL = 'http://www.mediafire.com/download/raju14e8aq6azbo/Getting+Started+with+MediaFire.pdf'
filename = "file.pdf"
with urlopen(URL) as in_stream, open(filename, 'wb') as out_file:
copyfileobj(in_stream, out_file)
(I found this last code at: What command to use instead of urllib.request.urlretrieve?)
The problem is that this code downloads an html document and not the .pdf file named "Getting Started with MediaFire.pdf" that I need!
I' m looking for a way to download the file that is served behind the html page.
Any suggestion ?
That's because the link you're trying to download is not a pdf file. It's a html document. You can open with chrome/firefox/other browsers.
You need to find the correct link to download. Try using "save as" in the browser - if that works, then the python code will work
Just because a URL ends with ".pdf" doesn't imply it is really a pdf.
For your example the correct link is - http://download834.mediafire.com/dsq8ih5dubng/raju14e8aq6azbo/Getting+Started+with+MediaFire.pdf which actually downloads the file if you use ctrl+s or wget or curl.
Sorry, sometimes I am the most ninny person in the world !
JDK was right, I used the wrong URL all the times, even when JDK sayed me to change the URL, I changed it using another wrong URL !
SO
I marked JDK's answer like the correct one AND down below I posted the code that i finally used :
import urllib2,fpformat
url = "http://download1063.mediafire.com/qjhujh1ajzwg/raju14e8aq6azbo/Getting+Started+with+MediaFire.pdf"
file_name = url.split('/')[-1]
u = urllib2.urlopen(url)
f = open(file_name, 'wb')
meta = u.info()
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (file_name, file_size)
print ""
file_size_dl = 0
block_sz = int(fpformat.fix(file_size/110,0))
print block_sz
while True:
buffer = u.read(block_sz)
if not buffer:
break
file_size_dl += len(buffer)
f.write(buffer)
status = ( file_size_dl * 100 ) / file_size
print status , ' % - ',file_size_dl,' byte su ',file_size,' byte'
f.close()
print " complete ! "
It isn' t the most useful code, I' m working on a code that is more fast and correct and I will post it down below as soon as I finish it !
Description of code
My script below works fine. It basically just finds all the data files that I'm interested in from a given website, checks to see if they are already on my computer (and skips them if they are), and lastly downloads them using cURL on to my computer.
The problem
The problem I'm having is sometimes there are 400+ very large files and I can't download them all at the same time. I'll push Ctrl-C but it seems to cancel the cURL download not the script so I end up needing to cancel all the downloads one by one. Is there a way around this? Maybe somehow making a key command that will let me stop at the end of the current download?
#!/usr/bin/python
import os
import urllib2
import re
import timeit
filenames = []
savedir = "/Users/someguy/Documents/Research/VLF_Hissler/Data/"
#connect to a URL
website = urllib2.urlopen("http://somewebsite")
#read html code
html = website.read()
#use re.findall to get all the data files
filenames = re.findall('SP.*?\.mat', html)
#The following chunk of code checks to see if the files are already downloaded and deletes them from the download queue if they are.
count = 0
countpass = 0
for files in os.listdir(savedir):
if files.endswith(".mat"):
try:
filenames.remove(files)
count += 1
except ValueError:
countpass += 1
print "counted number of removes", count
print "counted number of failed removes", countpass
print "number files less removed:", len(filenames)
#saves the file names into an array of html link
links=len(filenames)*[0]
for j in range(len(filenames)):
links[j] = 'http://somewebsite.edu/public_web_junk/southpole/2014/'+filenames[j]
for i in range(len(links)):
os.system("curl -o "+ filenames[i] + " " + str(links[i]))
print "links downloaded:",len(links)
You could always check the file size using curl before downloading it:
import subprocess, sys
def get_file_size(url):
"""
Gets the file size of a URL using curl.
#param url: The URL to obtain information about.
#return: The file size, as an integer, in bytes.
"""
# Get the file size in bytes
p = subprocess.Popen(('curl', '-sI', url), stdout=subprocess.PIPE)
for s in p.stdout.readlines():
if 'Content-Length' in s:
file_size = int(s.strip().split()[-1])
return file_size
# Your configuration parameters
url = ... # URL that you want to download
max_size = ... # Max file size in bytes
# Now you can do a simple check to see if the file size is too big
if get_file_size(url) > max_size:
sys.exit()
# Or you could do something more advanced
bytes = get_file_size(url)
if bytes > max_size:
s = raw_input('File is {0} bytes. Do you wish to download? '
'(yes, no) '.format(bytes))
if s.lower() == 'yes':
# Add download code here....
else:
sys.exit()
Using the zipfile module I have created a script to extract my archived files, but the method is corrupting everything other than txt files.
def unzip(zip):
filelist = []
dumpfold = r'M:\SVN_EReportingZones\eReportingZones\data\input\26012012'
storage = r'M:\SVN_EReportingZones\eReportingZones\data\input\26012012__download_dump'
file = storage + '\\' + zip
unpack = dumpfold + '\\' + str(zip)
print file
try:
time.sleep(1)
country = str(zip[:2])
countrydir = dumpfold + '\\' + country
folderthere = 0
if exists(countrydir):
folderthere = 1
if folderthere == 0:
os.makedirs(countrydir)
zfile = zipfile.ZipFile(file, 'r')
## print zf.namelist()
time.sleep(1)
shapepresent = 0
Here I have a problem - by reading and writing the zipped data, the zipfile command seems to be rendering it unusable by the programs in question - I am trying to unzip shapefiles for use in ArcGIS...
for info in zfile.infolist():
fname = info.filename
data = zfile.read(fname)
zfilename = countrydir + '\\' + fname
fout = open(zfilename, 'w')# reads and copies the data
fout.write(data)
fout.close()
print 'New file created ----> %s' % zfilename
except:
traceback.print_exc()
time.sleep(5)
Would it be possible to call WinRar using a system command and get it to do my unpacking for me? Cheers, Alex
EDIT
Having used the wb method, it works for most of my files but some are still being corrupted. When I used winRar to manually unzip the problematic files they load properly, and they also show a larger ile size.
Please could somebody point me in the direction of loading winRar for the complete unzip process?
You are opening the file in a text mode. Try:
fout = open(zfilename, 'wb')# reads and copies the data
The b opens the file in a binary mode, where the runtime libraries don't try to do any newline conversion.
To answer the second section of your question, I suggest the envoy library. To use winRar with envoy:
import envoy
r = envoy.run('unrar e {0}'.format(zfilename))
if r.status_code > 0:
print r.std_err
print r.std_out
To do it without envoy:
import subprocess
r = subprocess.call('unrar e {0}'.format(zfilename), shell=True)
print "Return code for {0}: {1}".format(zfilename, r)
I am attempting to look at how an HTML5 app works and any attempts to save the page inside the webkit browsers (chrome, Safari) includes some, but not all of the cache.manifest resources. Is there a library or set of code that will parse the cache.manifest file, and download all the resources (images, scripts, css)?
(original code moved to answer... noob mistake >.<)
I originally posted this as part of the question... (no newbie stackoverflow poster EVER does this ;)
since there was a resounding lack of answers. Here you go:
I was able to come up with the following python script to do so, but any input would be appreciated =) (This is my first stab at python code so there might be a better way)
import os
import urllib2
import urllib
cmServerURL = 'http://<serverURL>:<port>/<path-to-cache.manifest>'
# download file code taken from stackoverflow
# http://stackoverflow.com/questions/22676/how-do-i-download-a-file-over-http-using-python
def loadURL(url, dirToSave):
file_name = url.split('/')[-1]
u = urllib2.urlopen(url)
f = open(dirToSave, 'wb')
meta = u.info()
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (file_name, file_size)
file_size_dl = 0
block_sz = 8192
while True:
buffer = u.read(block_sz)
if not buffer:
break
file_size_dl += len(buffer)
f.write(buffer)
status = r"%10d [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
status = status + chr(8)*(len(status)+1)
print status,
f.close()
# download the cache.manifest file
# since this request doesn't include the Conent-Length header we will use a different api =P
urllib.urlretrieve (cmServerURL+ 'cache.manifest', './cache.manifest')
# open the cache.manifest and go through line-by-line checking for the existance of files
f = open('cache.manifest', 'r')
for line in f:
filepath = line.split('/')
if len(filepath) > 1:
fileName = line.strip()
# if the file doesn't exist, lets download it
if not os.path.exists(fileName):
print 'NOT FOUND: ' + line
dirName = os.path.dirname(fileName)
print 'checking dirctory: ' + dirName
if not os.path.exists(dirName):
os.makedirs(dirName)
else:
print 'directory exists'
print 'downloading file: ' + cmServerURL + line,
loadURL (cmServerURL+fileName, fileName)