How to handle downloading files in Python? - python

I have an array, that contains URL addresses to remote files.
By default I tried to download all files using this bad approach:
for a in ARRAY:
wget.download(url=A, out=path_folder)
So, it falls by the some reason: host server return timeout, some URL are broken etc.
How to handle this process more professional? But I can not apply this to my case.

If you still want to use wget, you can wrap the download in a try..except block that just prints any exception and moves on to the next file:
for f in files:
try:
wget.download(url=f, out=path_folder)
except Exception as e:
print("Could not download file {}".format(f)
print(e)

Here you have a way to define a timeout, it reads the filename from the url and retrieves big files as stream, so your memory won't get overfilled
import requests
import urlparse, os
timeout = 30 # Seconds
for url in urls:
try:
# Make the actual request, set the timeout for no data to X seconds and enable streaming responses so we don't have to keep the large files in memory
request = requests.get(url, timeout=timeout, stream=True)
# Get the Filename from the URL
name = os.path.basename(urlparse.urlparse(url).path)
# Open the output file and make sure we write in binary mode
with open(name, 'wb') as fh:
# Walk through the request response in chunks of 1024 * 1024 bytes, so 1MiB
for chunk in request.iter_content(1024 * 1024):
# Write the chunk to the file
fh.write(chunk)
except Exception as e:
print("Something went wrong:", e)

you can use urllib
import urllib.request
urllib.request.urlretrieve('http://www.example.com/files/file.ext', 'folder/file.ext')
you can use a try: except: around the retrive url, to catch any errors
try:
urllib.request.urlretrieve('http://www.example.com/files/file.ext', 'folder/file.ext')
except Exception as e:
print('The server couldn\'t fulfill the request.')
print('Error code: ', e.code)

Adding it as another answer,
If you want to solve the timeout you can use the requests library
import requests
try:
requests.get('http://url/to/file')
catch Exception as e:
print('Error code: ', e.code)
if you haven't specified any time it won't timeout

Related

Stop pulling data from an API after a specific amount of time

I am pulling some data from a streaming API using python v3, I need to stop pulling that data after 60 seconds. Also if anyone has suggestions on chunk_size or some alternative for streaming I'd be open to that.
So far this is what I have:
response = requests.get('link to site', stream=True)
for data in response.iter_content(chunk_size=100):
print(data)
More speculation than answer, but you could set a timer and then close the response. This may do it but I don't have a good way to test it. I don't know which exception to expect when the response is closed so I catch them all and print so the code can be changed.
import threading
response = requests.get('link to site', stream=True)
timer = threading.Timer(60, response.close)
try:
timer.start()
for data in response.iter_content(chunk_size=100):
print(data)
except Exception as e:
print("you want to catch this", e)
finally:
timer.cancel()

How to import python requests using a file in order to output the status code

I'm new with Python(3) so please don't bash me :D
I'm using the following code in order to import a .txt file which contains different URLs so I can check their status code.
In my example, I added 4 site URLs
here is the import.txt with just one URL:
https://site1.site
https://site2.site
https://site3.site
https://site4.site
https://site5.site
while this is the py script itself:
import requests
with open('import.txt', 'r') as f :
for line in f :
print(line)
#try :
r = requests.get(line)
print(r.status_code)
#except requests.ConnectionError :
# print("failed to connect")
this is the response:
https://site1.site
https://site2.site
https://site3.site
https://site4.site
https://site5.site
400
Even though site3 and site4 are 301's while site5 has a failed to connect response I only receive a 400 response which applies to all of the submitted URLs.
If I request.head for each one of those URLs using the following script then I receive the correct page status code('Moved Permantly' for the example below). This is the single request script:
import requests
try:
r = requests.head("http://site3.net/")
if r.status_code == 200:
print('Success!')
elif r.status_code == 301:
print('Moved Permanently')
elif r.status_code == 404:
print('Not Found')
# print(r.status_code)
except requests.ConnectionError:
print("failed to connect")
kudos to What’s the best way to get an HTTP response code from a URL?
Your call to requests.get() is outside the for loop, and so is only executed once. Try indenting the relevant lines, like so:
import requests
with open('import;txt', 'r') as f :
for line in f :
print(line)
#try :
r = requests.get(line)
print(r.status_code)
#except requests.ConnectionError :
# print("failed to connect")
Ps. I suggest you use 4-space indents. That way, errors like this are easier to spot.

Python Error Handling when using requests

I wrote the script below to be able to connect to a remote server and get some data from the XML file. I added some error handling to be able to skip issues with some devices. For some reason whenever the script gets a 401 message back, it breaks the whole loop and I get the message "Could not properly read the csv file". I tried other ways of handling the exception and it would fail at other points. Any info on how to properly deal with this?
#!/usr/bin/python
import sys, re, csv, xmltodict
import requests, logging
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
def version(ip, username, password):
baseUrl = "https://" + ip
session = requests.Session()
session.verify = False
session.timeout = 45
print "Connecting to " + ip
try:
r = session.get(baseUrl + '/getxml?location=/Status', auth=(username, password))
r.raise_for_status()
except Exception as error:
print err
doc = xmltodict.parse(r.text)
version = str(doc['Status']['#version'])
def main():
try:
with open('list.csv', 'r') as file:
reader = csv.DictReader(file)
for row in reader:
version(row['ip'], row['Username'], row['Password'])
except Exception as error:
print ValueError("Could not properly read the csv file \r")
sys.exit(0)
if __name__ == "__main__":
main()
The doc and version variables in def version are outside the try: catch: so when r is None due to exception, the next 2 operations also fail, raising some uncaught exception. Which surfaces in main. Can you try including doc and version within the try: catch: and see if it works.
A related suggestion: catch specific exceptions as this helps know more about why your code crashed. ex. Response.raise_for_status() raises requests.exceptions.HTTPError. Catch that, raise all other exceptions. xml might raise something else, catch that, instead of catching ALL.

Generate http error from python3 requests

I have a simple long poll thing using python3 and the requests package. It currently looks something like:
def longpoll():
session = requests.Session()
while True:
try:
fetched = session.get(MyURL)
input = base64.b64decode(fetched.content)
output = process(data)
session.put(MyURL, data=base64.b64encode(response))
except Exception as e:
print(e)
time.sleep(10)
There is a case where instead of processing the input and puting the result, I'd like to raise an http error. Is there a simple way to do this from the high level Session interface? Or do I have to drill down to use the lower level objects?
Since You have control over the server you may want to reverse the 2nd call
Here is an example using bottle to recive the 2nd poll
def longpoll():
session = requests.Session()
while True: #I'm guessing that the server does not care that we call him a lot of times ...
try:
session.post(MyURL, {"ip_address": my_ip_address}) # request work or I'm alive
#input = base64.b64decode(fetched.content)
#output = process(data)
#session.put(MyURL, data=base64.b64encode(response))
except Exception as e:
print(e)
time.sleep(10)
#bottle.post("/process")
def process_new_work():
data = bottle.request.json()
output = process(data) #if an error is thrown an HTTP error will be returned by the framework
return output
This way the server will get the output or an bad HTTP status

Urllib2 Python - Reconnecting and Splitting Response

I am moving to Python from other language and I am not sure how to properly tackle this. Using the urllib2 library it is quite easy to set up a proxy and get a data from a site:
import urllib2
req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req)
the_page = response.read()
The problem I have is that the text file that is retrieved is very large (hundreds of MB) and the connection is often problematic. The code also need to catch connection, server and transfer errors (it will be a part of small extensively used pipeline).
Could anyone suggest how to modify the code above to make sure the code automatically reconnects n times (for example 100 times) and perhaps split the response into chunks so the data will be downloaded faster and more reliably?
I have already split the requests as much as I could so now have to make sure that the retrieve code is as good as it can be. Solutions based on core python libraries are ideal.
Perhaps the library is already doing the above in which case is there any way to improve downloading large files? I am using UNIX and need to deal with a proxy.
Thanks for your help.
I'm putting up an example of how you might want to do this with the python-requests library. The script below checks if the destinations file already exists. If the partially destination file exists, it's assumed to be the partially downloaded file, and tries to resume the download. If the server claims support for a HTTP Partial Request (i.e. the response to a HEAD request contains Accept-Range header), then the script resume based on the size of the partially downloaded file; otherwise it just does a regular download and discard the parts that are already downloaded. I think it should be fairly straight forward to convert this to use just urllib2 if you don't want to use python-requests, it'll probably be just much more verbose.
Note that resuming downloads may corrupt the file if the file on the server is modified between the initial download and the resume. This can be detected if the server supports strong HTTP ETag header so the downloader can check whether it's resuming the same file.
I make no claim that it is bug-free.
You should probably add a checksum logic around this script to detect download errors and retry from scratch if the checksum doesn't match.
import logging
import os
import re
import requests
CHUNK_SIZE = 5*1024 # 5KB
logging.basicConfig(level=logging.INFO)
def stream_download(input_iterator, output_stream):
for chunk in input_iterator:
output_stream.write(chunk)
def skip(input_iterator, output_stream, bytes_to_skip):
total_read = 0
while total_read <= bytes_to_skip:
chunk = next(input_iterator)
total_read += len(chunk)
output_stream.write(chunk[bytes_to_skip - total_read:])
assert total_read == output_stream.tell()
return input_iterator
def resume_with_range(url, output_stream):
dest_size = output_stream.tell()
headers = {'Range': 'bytes=%s-' % dest_size}
resp = requests.get(url, stream=True, headers=headers)
input_iterator = resp.iter_content(CHUNK_SIZE)
if resp.status_code != requests.codes.partial_content:
logging.warn('server does not agree to do partial request, skipping instead')
input_iterator = skip(input_iterator, output_stream, output_stream.tell())
return input_iterator
rng_unit, rng_start, rng_end, rng_size = re.match('(\w+) (\d+)-(\d+)/(\d+|\*)', resp.headers['Content-Range']).groups()
rng_start, rng_end, rng_size = map(int, [rng_start, rng_end, rng_size])
assert rng_start <= dest_size
if rng_start != dest_size:
logging.warn('server returned different Range than requested')
output_stream.seek(rng_start)
return input_iterator
def download(url, dest):
''' Download `url` to `dest`, resuming if `dest` already exists
If `dest` already exists it is assumed to be a partially
downloaded file for the url.
'''
output_stream = open(dest, 'ab+')
output_stream.seek(0, os.SEEK_END)
dest_size = output_stream.tell()
if dest_size == 0:
logging.info('STARTING download from %s to %s', url, dest)
resp = requests.get(url, stream=True)
input_iterator = resp.iter_content(CHUNK_SIZE)
stream_download(input_iterator, output_stream)
logging.info('FINISHED download from %s to %s', url, dest)
return
remote_headers = requests.head(url).headers
remote_size = int(remote_headers['Content-Length'])
if dest_size < remote_size:
logging.info('RESUMING download from %s to %s', url, dest)
support_range = 'bytes' in [s.strip() for s in remote_headers['Accept-Ranges'].split(',')]
if support_range:
logging.debug('server supports Range request')
logging.debug('downloading "Range: bytes=%s-"', dest_size)
input_iterator = resume_with_range(url, output_stream)
else:
logging.debug('skipping %s bytes', dest_size)
resp = requests.get(url, stream=True)
input_iterator = resp.iter_content(CHUNK_SIZE)
input_iterator = skip(input_iterator, output_stream, bytes_to_skip=dest_size)
stream_download(input_iterator, output_stream)
logging.info('FINISHED download from %s to %s', url, dest)
return
logging.debug('NOTHING TO DO')
return
def main():
TEST_URL = 'http://mirror.internode.on.net/pub/test/1meg.test'
DEST = TEST_URL.split('/')[-1]
download(TEST_URL, DEST)
main()
You can try something like this. It reads the file line by line and appends it to a file. It also checks to make sure that you don't go over the same line. I'll write another script that does it by chunks as well.
import urllib2
file_checker = None
print("Please Wait...")
while True:
try:
req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req, timeout=20)
print("Connected")
with open("outfile.html", 'w+') as out_data:
for data in response.readlines():
file_checker = open("outfile.html")
if data not in file_checker.readlines():
out_data.write(str(data))
break
except urllib2.URLError:
print("Connection Error!")
print("Connecting again...please wait")
file_checker.close()
print("done")
Here's how to read the data in chunks instead of by lines
import urllib2
CHUNK = 16 * 1024
file_checker = None
print("Please Wait...")
while True:
try:
req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req, timeout=1)
print("Connected")
with open("outdata", 'wb+') as out_data:
while True:
chunk = response.read(CHUNK)
file_checker = open("outdata")
if chunk and chunk not in file_checker.readlines():
out_data.write(chunk)
else:
break
break
except urllib2.URLError:
print("Connection Error!")
print("Connecting again...please wait")
file_checker.close()
print("done")

Categories

Resources