I'm trying to stream a download from an nginx server and simultaneously upload it. The download is using requests stream implementation; the upload is using chunking - the intention is to be able to report progress as the down/upload is occurring.
The overall code of what I've got so far is like so:
with closing(requests.get(vmdk_url, stream=True, timeout=60 + 1)) as vmdk_request:
chunk_in_bytes = 50 * 1024 * 1024
total_length = int(vmdk_request.headers['Content-Length'])
def vmdk_streamer():
sent_length = 0
for data in vmdk_request.iter_content(chunk_in_bytes):
sent_length += len(data)
progress_in_percent = (sent_length / (total_length * 1.0)) * 100
lease.HttpNfcLeaseProgress(int(progress_in_percent))
yield data
result = requests.post(
upload_url, data=vmdk_streamer(), verify=False,
headers={'Content-Type': 'application/x-vnd.vmware-streamVmdk'})
Which, in a certain set of contexts, works fine. I put it into another (a Cloudify plugin, if you're interested) and when it reaches around 60s it fails to read data.
So I'm looking for an alternative - or simply better - way of streaming a download/upload as my 60s issue might revolve around how I'm streaming (I hope). Preferably with requests but really I'd use anything up to and including raw urllib3.
Related
I'm trying to download a medium-sized APK file ( 10-300 MB ) and save it locally. My connection speed should be about 90 mbps, yet the process rarely surpasses 1 mbps, and my network doesn't seem to be anywhere near cap.
I've verified the part that's getting stuck is indeed the SSL download with cProfile, and I've tried various advice on StackOverflow like reducing or increasing chunk size, to no avail. I'd love a way to either test if this could be a server-side issue, or advice on what am I doing wrong on the clientside.
Relevant code:
session = requests.Session() # I've heard session is better due to the persistent HTTP connection
session.trust_env = False
r = session.get(<url>, headers=REQUEST_HEADERS, stream=True, timeout=TIMEOUT) # timeout=60.
r.raise_for_status()
filename = 'myFileName'
i = 0
with open(filename, 'wb') as result:
for chunk in r.iter_content(chunk_size=1024*1024):
if chunk:
i += 1
if(i % 5 == 0):
print(f'At chunk {i} with timeout {TIMEOUT}')
result.write(chunk)
I was trying to download many different files; Upon printing the urls I'm trying to download and testing in chrome I saw some of the URLs were significantly slower than other ones to download.
It seems like a server issue, which I ended up solving by picking a good timeout to the requests.
I have a set of image url with index, now I want to parse it through a downloader that can download multiple files at a time to speed up the process.
I tried to put the file name and URL to dicts(name and d2 respectively) and then use requests and threading to do that:
def Thread(start,stop):
for i in range(start, stop):
url = d2[i]
r = requests.get(url)
with open('assayImage/{}'.format(name[i]), 'wb') as f:
f.write(r.content)
for n in range(0, len(d2), 1500):
stop = n + 1500 if n +1500 <= len(d2) else len(d2)
threading.Thread(target = Thread, args = (n,stop)).start()
However, sometimes the connection is timed out and that file will not be downloaded, and after a while, the download speed decreases dramatically. For example, for the first 1 hour, I can download 10000 files, but 3 hours later I can only download 8000 files. Each file size is small, around 500KB.
So, I want to ask that is there any stable way to download a large number of multiple files? I really appreciate your answer.
Summary:
I have an issue where sometimes a the google-drive-sdk for python does not detect the end of the document being exported. It seems to think that the google document is of infinite size.
Background, source code and tutorials I followed:
I am working on my own python based google-drive backup script (one with a nice CLI interface for browsing around). git link for source code
Its still in the making and currently only finds new files and downloads them (with 'pull' command).
To do the most important google-drive commands, I followed the official google drive api tutorials for downloading media. here
What works:
When a document or file is a non-google-docs document, the file is downloaded properly. However, when I try to "export" a file. I see that I need to use a different mimeType. I have a dictionary for this.
For example: I map application/vnd.google-apps.document to application/vnd.openxmlformats-officedocument.wordprocessingml.document when exporting a document.
When downloading google documents documents from google drive, this seems to work fine. By this I mean: my while loop with the code status, done = downloader.next_chunk() will eventual set done to true and the download completes.
What does not work:
However, on some files, the done flag never gets to true and script will download forever. This eventually amounts to several Gb. Perhaps I am looking for the wrong flag that says the file is complete when doing an export. I am surprised that google-drive never throws an error. Anybody know what could cause this?
Current status
For now I have exporting of google documents disabled in my code.
When I use scripts like "drive by rakyll" (at least the version I have) just puts a link to the online copy. I would really like to do a proper export so that my offline system can maintain a complete backup of everything on drive.
P.s. It's fine to put "you should use this service instead of the api" for the sake of others finding this page. I know that there are other services out there for this, but I'm really looking to explore the drive-api functions for integration with my own other systems.
OK. I found a pseudo solution here.
The problem is that the Google API never returns the Content-Length and the response is done in Chunks. However, either the chunk returned is wrong, or the Python API is not able to process it correctly.
What I did was, grab the code for the MediaIoBaseDownload from here
I left all the same, but changed this part:
if 'content-range' in resp:
content_range = resp['content-range']
length = content_range.rsplit('/', 1)[1]
self._total_size = int(length)
elif 'content-length' in resp:
self._total_size = int(resp['content-length'])
else:
# PSEUDO BUG FIX: No content-length, no chunk info, cut the response here.
self._total_size = self._progress
The else at the end is what I've added. I've also changed the default chunk size by setting DEFAULT_CHUNK_SIZE = 2*1024*1024. Also you will have to copy a few imports from that file, including this one from googleapiclient.http import _retry_request, _should_retry_response
Of course this is not a solution, it just says "if I don't understand the response, just stop it here". This will probably make some exports not work, but at least it doesn't kill the server. This is only until we can find a good solution.
UPDATE:
Bug is already reported here: https://github.com/google/google-api-python-client/issues/15
and as of January 2017, the only workaround is to not use MediaIoBaseDownload and do this instead (not suitable to large files):
req = service.files().export(fileId=file_id, mimeType=mimeType)
resp = req.execute(http=http)
I'm using this and it's works with the following library:
google-auth-oauthlib==0.4.1
google-api-python-client
google-auth-httplib2
This is the snippet I'm using:
from apiclient import errors
from googleapiclient.http import MediaIoBaseDownload
from googleapiclient.discovery import build
def download_google_document_from_drive(self, file_id):
try:
request = self.service.files().get_media(fileId=file_id)
fh = io.BytesIO()
downloader = MediaIoBaseDownload(fh, request)
done = False
while done is False:
status, done = downloader.next_chunk()
print('Download %d%%.' % int(status.progress() * 100))
return fh
except Exception as e:
print('Error downloading file from Google Drive: %s' % e)
You can write the file stream to a file:
import xlrd
workbook = xlrd.open_workbook(file_contents=fh.getvalue())
My website has a feature of exporting daily report in excel which may vary according to users. Due to some reason i can't consider redis or memcache. For each user the no of rows in db are around 2-5 lacks. when user call the export-to-excel feature it takes 5-10 minutes to export and till that website all resources(ram,cpu) are used in making that excel and that results site-down for 5 minutes and after 5 minutes everything work fine. I also chunked the the query result in small part for solving RAM issue it solves my 50 percent problem. is there is any other solution for CPU and RAM optimization?
sample code
def import_to_excel(request):
order_list = Name.objects.all()
book = xlwt.Workbook(encoding='utf8')
default_style = xlwt.Style.default_style
style = default_style
fname = 'order_data'
sheet = book.add_sheet(fname)
row = -1
for order in order_list:
row+=1
sheet.write(row, 1,order.first_name, style=style)
sheet.write(row, 2,order.last_name, style=style)
response = HttpResponse(mimetype='application/vnd.ms-excel')
response['Content-Disposition'] = 'attachment; filename=order_data_pull.xls'
book.save(response)
return response
Instead of a HttpResponse use StreamingHttpResponse
Streaming a file that takes a long time to generate you can avoid a load balancer dropping a connection that might have otherwise timed out while the server was generating the response.
You can also process your request asynchronously using celery.
Processing requests asynchronously will allow your server to accept any other request while the previous one is being processed by the worker in the background.
Thus your system will become more user friendly in that manner.
I have a task to download Gbs of data from a website. The data is in form of .gz files, each file being 45mb in size.
The easy way to get the files is use "wget -r -np -A files url". This will donwload data in a recursive format and mirrors the website. The donwload rate is very high 4mb/sec.
But, just to play around I was also using python to build my urlparser.
Downloading via Python's urlretrieve is damm slow, possible 4 times as slow as wget. The download rate is 500kb/sec. I use HTMLParser for parsing the href tags.
I am not sure why is this happening. Are there any settings for this.
Thanks
Probably a unit math error on your part.
Just noticing that 500KB/s (kilobytes) is equal to 4Mb/s (megabits).
urllib works for me as fast as wget. try this code. it shows the progress in percentage just as wget.
import sys, urllib
def reporthook(a,b,c):
# ',' at the end of the line is important!
print "% 3.1f%% of %d bytes\r" % (min(100, float(a * b) / c * 100), c),
#you can also use sys.stdout.write
#sys.stdout.write("\r% 3.1f%% of %d bytes"
# % (min(100, float(a * b) / c * 100), c)
sys.stdout.flush()
for url in sys.argv[1:]:
i = url.rfind('/')
file = url[i+1:]
print url, "->", file
urllib.urlretrieve(url, file, reporthook)
import subprocess
myurl = 'http://some_server/data/'
subprocess.call(["wget", "-r", "-np", "-A", "files", myurl])
As for the html parsing, the fastest/easiest you will probably get is using lxml
As for the http requests themselves: httplib2 is very easy to use, and could possibly speed up downloads because it supports http 1.1 keep-alive connections and gzip compression. There is also pycURL which claims to be very fast (but more difficult to use), and is build on curllib, but I've never used that.
You could also try to download different files concurrently, but also keep in mind that trying to optimize your download times too far may be not very polite towards the website in question.
Sorry for the lack of hyperlinks, but SO tells me "sorry, new users can only post a maximum of one hyperlink"
Transfer speeds can be easily misleading.. Could you try with the following script, which simply downloads the same URL with both wget and urllib.urlretrieve - run it a few times incase you're behind a proxy which caches the URL on the second attempt.
For small files, wget will take slightly longer due to the external process' startup time, but for larger files that should be come irrelevant.
from time import time
import urllib
import subprocess
target = "http://example.com" # change this to a more useful URL
wget_start = time()
proc = subprocess.Popen(["wget", target])
proc.communicate()
wget_end = time()
url_start = time()
urllib.urlretrieve(target)
url_end = time()
print "wget -> %s" % (wget_end - wget_start)
print "urllib.urlretrieve -> %s" % (url_end - url_start)
Maybe you can wget and then inspect the data in Python?
Since python suggests using urllib2 instead of urllib, I take a test between urllib2.urlopen and wget.
The result is, it takes nearly the same time for both of them to download the same file.Sometimes, urllib2 performs even better.
The advantage of wget lies in a dynamic progress bar to show the percent finished and the current download speed when transferring.
The file size in my test is 5MB.I haven't used any cache module in python and I am not aware of how wget works when downloading big size file.
There shouldn't be a difference really. All urlretrieve does is make a simple HTTP GET request. Have you taken out your data processing code and done a straight throughput comparison of wget vs. pure python?
Please show us some code. I'm pretty sure that it has to be with the code and not on urlretrieve.
I've worked with it in the past and never had any speed related issues.
You can use wget -k to engage relative links in all urls.