I'm trying to download a medium-sized APK file ( 10-300 MB ) and save it locally. My connection speed should be about 90 mbps, yet the process rarely surpasses 1 mbps, and my network doesn't seem to be anywhere near cap.
I've verified the part that's getting stuck is indeed the SSL download with cProfile, and I've tried various advice on StackOverflow like reducing or increasing chunk size, to no avail. I'd love a way to either test if this could be a server-side issue, or advice on what am I doing wrong on the clientside.
Relevant code:
session = requests.Session() # I've heard session is better due to the persistent HTTP connection
session.trust_env = False
r = session.get(<url>, headers=REQUEST_HEADERS, stream=True, timeout=TIMEOUT) # timeout=60.
r.raise_for_status()
filename = 'myFileName'
i = 0
with open(filename, 'wb') as result:
for chunk in r.iter_content(chunk_size=1024*1024):
if chunk:
i += 1
if(i % 5 == 0):
print(f'At chunk {i} with timeout {TIMEOUT}')
result.write(chunk)
I was trying to download many different files; Upon printing the urls I'm trying to download and testing in chrome I saw some of the URLs were significantly slower than other ones to download.
It seems like a server issue, which I ended up solving by picking a good timeout to the requests.
Related
Goal: I want to download daily data for every day going back several years off a website.
This website has a login and on each page it only has 7 CSV files, then you have to click previous one etc to view the previous 7. Ideally I wanna download all of these into one folder for all the daily data.
The link to the donwloading of the files does follow a very simple format which I attempted to take advantage of:
https://cranedata.com/publications/download/mfi-daily-data/issue/2020-09-08/
with the ending only changing for each date disregarding weekends.
I have attempted to modify several versions of code but ultimatly have not found anything that works.
#!/usr/bin/env ipython
# --------------------
import requests
import shutil
import datetime
# -----------------------------------------------------------------------------------
dates=[datetime.datetime(2019,1,1)+datetime.timedelta(dval) for dval in range(0,366)];
# -----------------------------------------------------------------------------------
for dateval in dates:
r = requests.get('https://cranedata.com/publications/download/mfi-daily-data/issue/'+dateval.strftime('%Y-%m-%d'), stream=True)
if r.status_code == 200:
with open(dateval.strftime('%Y%m%d')+".csv", 'wb') as f:
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
# ---------------------------------------------------------------------------------
This does seem to work for other files on other websites but the CSV files do not actually have data when I download them.
This is what my excel files say instead of the actual data:
https://prnt.sc/ugju49
You need to add authentication information to the request. This is either done using a "header", or in via a "cookie". You can use a requests.Session object to simplify this for both.
To be able to give you more details is not possible without knowing which technologies are used for authentication.
Chances are (by the looks of the site) that it uses a server-side session. So there should be something like a "session id" or "sid" in your headers when "talking to the back-end". You need to open the browser's "developer tools" and look closely at the "request headers". Also the response and response-headers when you perform a login.
If you are very lucky, just using a requests.Session might be enough as long as you perform a login in the beginning of the session. Something like this:
#!/usr/bin/env ipython
import requests
import shutil
import datetime
dates=[datetime.datetime(2019,1,1)+datetime.timedelta(dval) for dval in range(0,366)];
with requests.Session() as sess:
sess.post(authentication-details)
for dateval in dates:
r = sess.get('https://cranedata.com/publications/download/mfi-daily-data/issue/'+dateval.strftime('%Y-%m-%d'), stream=True)
if r.status_code == 200:
with open(dateval.strftime('%Y%m%d')+".csv", 'wb') as f:
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
If this does not work you need to closely inspect the network-tab in the developer tools and pick out the interesting bits and reproduce them in your code.
Which parts are the "interesting bits" depends on the back-end, and without further details, I cannot say.
To briefly explain context, I am downloading SEC prospectus data for example. After downloading I want to parse the file to extract certain data, then output the parsed dictionary to a JSON file which consists of a list of dictionaries. I would use a SQL database for output, but the research cluster admins at my university are being slow getting me access. If anyone has any suggestions for how to store the data for easy reading/writing later I would appreciate it, I was thinking about HDF5 as a possible alternative.
A minimal example of what I am doing with the spots that I think I need to improved labeled.
def classify_file(doc):
try:
data = {
'link': doc.url
}
except AttributeError:
return {'flag': 'ATTRIBUTE ERROR'}
# Do a bunch of parsing using regular expressions
if __name__=="__main__":
items = list()
for d in tqdm([y + ' ' + q for y in ['2019'] for q in ['1']]):
stream = os.popen('bash ./getformurls.sh ' + d)
stacked = stream.read().strip().split('\n')
# split each line into the fixed-width fields
widths=(12,62,12,12,44)
items += [[item[sum(widths[:j]):sum(widths[:j+1])].strip() for j in range(len(widths))] for item in stacked]
urls = [BASE_URL + item[4] for item in items]
resp = list()
# PROBLEM 1
filelimit = 100
for i in range(ceil(len(urls)/filelimit)):
print(f'Downloading: {i*filelimit/len(urls)*100:2.0f}%... ',end='\r',flush=True)
resp += [r for r in grequests.map((grequests.get(u) for u in urls[i*filelimit:(i+1)*filelimit]))]
# PROBLEM 2
with Pool() as p:
rs = p.map_async(classify_file,resp,chunksize=20)
rs.wait()
prospectus = rs.get()
with open('prospectus_data.json') as f:
json.dump(prospectus,f)
The getfileurls.sh referenced is a bash script I wrote that was faster than doing it in python since I could use grep, the code for that is
#!/bin/bash
BASE_URL="https://www.sec.gov/Archives/"
INDEX="edgar/full-index/"
url="${BASE_URL}${INDEX}$1/QTR$2/form.idx"
out=$(curl -s ${url} | grep "^485[A|B]POS")
echo "$out"
PROBLEM 1: So I am currently pulling about 18k files in the grequests map call. I was running into an error about too many files being open so I decided to split up the urls list into manageable chunks. I don't like this solution, but it works.
PROBLEM 2: This is where my actual error is. This code runs fine on a smaller set of urls (~2k) on my laptop (uses 100% of my cpu and ~20GB of RAM ~10GB for the file downloads and another ~10GB when the parsing starts), but when I take it to the larger 18k dataset using 40 cores on a research cluster it spins up to ~100GB RAM and ~3TB swap usage then crashes after parsing about 2k documents in 20 minutes via a KeyboardInterrupt from the server.
I don't really understand why the swap usage is getting so crazy, but I think I really just need help with memory management here. Is there a way to create an generator of unsent requests that will be sent when I call classify_file() on them later? Any help would be appreciated.
Generally when you have runaway memory usage with a Pool it's because the workers are being re-used and accumulating memory with each iteration. You can occasionally close and re-open the pool to prevent this but it's so common of an issue that Python now has a built-in parameter to do it for you...
Pool(...maxtasksperchild) is the number of tasks a worker process can complete before it will exit and be replaced with a fresh worker process, to enable unused resources to be freed. The default maxtasksperchild is None, which means worker processes will live as long as the pool.
There's no way for me to tell you what the right value is but you generally want to set it low enough that resources can be freed fairly often but not so low that it slows things down. (Maybe a minutes worth of processing... just as a guess)
with Pool(maxtasksperchild=5) as p:
rs = p.map_async(classify_file,resp,chunksize=20)
rs.wait()
prospectus = rs.get()
For your first problem, you might consider just using requests and moving the call inside of the worker process you already have. Pulling 18K worth of URLs and caching all that data initially is going to take time and memory. If it's all encapsulated in the worker, you'll minimize data usage and you wont need to spin up so many open file handles.
I'm trying to stream a download from an nginx server and simultaneously upload it. The download is using requests stream implementation; the upload is using chunking - the intention is to be able to report progress as the down/upload is occurring.
The overall code of what I've got so far is like so:
with closing(requests.get(vmdk_url, stream=True, timeout=60 + 1)) as vmdk_request:
chunk_in_bytes = 50 * 1024 * 1024
total_length = int(vmdk_request.headers['Content-Length'])
def vmdk_streamer():
sent_length = 0
for data in vmdk_request.iter_content(chunk_in_bytes):
sent_length += len(data)
progress_in_percent = (sent_length / (total_length * 1.0)) * 100
lease.HttpNfcLeaseProgress(int(progress_in_percent))
yield data
result = requests.post(
upload_url, data=vmdk_streamer(), verify=False,
headers={'Content-Type': 'application/x-vnd.vmware-streamVmdk'})
Which, in a certain set of contexts, works fine. I put it into another (a Cloudify plugin, if you're interested) and when it reaches around 60s it fails to read data.
So I'm looking for an alternative - or simply better - way of streaming a download/upload as my 60s issue might revolve around how I'm streaming (I hope). Preferably with requests but really I'd use anything up to and including raw urllib3.
Summary:
I have an issue where sometimes a the google-drive-sdk for python does not detect the end of the document being exported. It seems to think that the google document is of infinite size.
Background, source code and tutorials I followed:
I am working on my own python based google-drive backup script (one with a nice CLI interface for browsing around). git link for source code
Its still in the making and currently only finds new files and downloads them (with 'pull' command).
To do the most important google-drive commands, I followed the official google drive api tutorials for downloading media. here
What works:
When a document or file is a non-google-docs document, the file is downloaded properly. However, when I try to "export" a file. I see that I need to use a different mimeType. I have a dictionary for this.
For example: I map application/vnd.google-apps.document to application/vnd.openxmlformats-officedocument.wordprocessingml.document when exporting a document.
When downloading google documents documents from google drive, this seems to work fine. By this I mean: my while loop with the code status, done = downloader.next_chunk() will eventual set done to true and the download completes.
What does not work:
However, on some files, the done flag never gets to true and script will download forever. This eventually amounts to several Gb. Perhaps I am looking for the wrong flag that says the file is complete when doing an export. I am surprised that google-drive never throws an error. Anybody know what could cause this?
Current status
For now I have exporting of google documents disabled in my code.
When I use scripts like "drive by rakyll" (at least the version I have) just puts a link to the online copy. I would really like to do a proper export so that my offline system can maintain a complete backup of everything on drive.
P.s. It's fine to put "you should use this service instead of the api" for the sake of others finding this page. I know that there are other services out there for this, but I'm really looking to explore the drive-api functions for integration with my own other systems.
OK. I found a pseudo solution here.
The problem is that the Google API never returns the Content-Length and the response is done in Chunks. However, either the chunk returned is wrong, or the Python API is not able to process it correctly.
What I did was, grab the code for the MediaIoBaseDownload from here
I left all the same, but changed this part:
if 'content-range' in resp:
content_range = resp['content-range']
length = content_range.rsplit('/', 1)[1]
self._total_size = int(length)
elif 'content-length' in resp:
self._total_size = int(resp['content-length'])
else:
# PSEUDO BUG FIX: No content-length, no chunk info, cut the response here.
self._total_size = self._progress
The else at the end is what I've added. I've also changed the default chunk size by setting DEFAULT_CHUNK_SIZE = 2*1024*1024. Also you will have to copy a few imports from that file, including this one from googleapiclient.http import _retry_request, _should_retry_response
Of course this is not a solution, it just says "if I don't understand the response, just stop it here". This will probably make some exports not work, but at least it doesn't kill the server. This is only until we can find a good solution.
UPDATE:
Bug is already reported here: https://github.com/google/google-api-python-client/issues/15
and as of January 2017, the only workaround is to not use MediaIoBaseDownload and do this instead (not suitable to large files):
req = service.files().export(fileId=file_id, mimeType=mimeType)
resp = req.execute(http=http)
I'm using this and it's works with the following library:
google-auth-oauthlib==0.4.1
google-api-python-client
google-auth-httplib2
This is the snippet I'm using:
from apiclient import errors
from googleapiclient.http import MediaIoBaseDownload
from googleapiclient.discovery import build
def download_google_document_from_drive(self, file_id):
try:
request = self.service.files().get_media(fileId=file_id)
fh = io.BytesIO()
downloader = MediaIoBaseDownload(fh, request)
done = False
while done is False:
status, done = downloader.next_chunk()
print('Download %d%%.' % int(status.progress() * 100))
return fh
except Exception as e:
print('Error downloading file from Google Drive: %s' % e)
You can write the file stream to a file:
import xlrd
workbook = xlrd.open_workbook(file_contents=fh.getvalue())
My website has a feature of exporting daily report in excel which may vary according to users. Due to some reason i can't consider redis or memcache. For each user the no of rows in db are around 2-5 lacks. when user call the export-to-excel feature it takes 5-10 minutes to export and till that website all resources(ram,cpu) are used in making that excel and that results site-down for 5 minutes and after 5 minutes everything work fine. I also chunked the the query result in small part for solving RAM issue it solves my 50 percent problem. is there is any other solution for CPU and RAM optimization?
sample code
def import_to_excel(request):
order_list = Name.objects.all()
book = xlwt.Workbook(encoding='utf8')
default_style = xlwt.Style.default_style
style = default_style
fname = 'order_data'
sheet = book.add_sheet(fname)
row = -1
for order in order_list:
row+=1
sheet.write(row, 1,order.first_name, style=style)
sheet.write(row, 2,order.last_name, style=style)
response = HttpResponse(mimetype='application/vnd.ms-excel')
response['Content-Disposition'] = 'attachment; filename=order_data_pull.xls'
book.save(response)
return response
Instead of a HttpResponse use StreamingHttpResponse
Streaming a file that takes a long time to generate you can avoid a load balancer dropping a connection that might have otherwise timed out while the server was generating the response.
You can also process your request asynchronously using celery.
Processing requests asynchronously will allow your server to accept any other request while the previous one is being processed by the worker in the background.
Thus your system will become more user friendly in that manner.