Slow downloads with Google Storage Bucket

Slow downloads with Google Storage Bucket - python

I am running into a strange issue downloading files from Google Storage Buckets.
If I am on Linux and run this code, a 64kb PDF file takes like 5 minutes to download.
def generate_document(request):
if not ensure_valid_user(request):
return redirect('/?result=0')
try:
long_name = request.GET['long_name']
short_name = request.GET['short_name']
file_data, size = CloudStorageManager.get_file(long_name)
response = HttpResponse(file_data, content_type='application/octet-stream')
response['Content-Disposition'] = 'attachment; filename={}'.format(short_name)
response['Content-Length'] = size
return response
except Exception as ex:
print(ex)
Here is the method from CloudStorageManager class that is important:
class CloudStorageManager:
# private key file, used for local testing
storage_client = storage.Client.from_service_account_json(
'CloudStorageAPIKey.json')
bucket = storage_client.get_bucket("my.private.bucket")
#staticmethod
def get_file(long_name):
bucket = CloudStorageManager.bucket
blob = bucket.blob(long_name)
file_string = blob.download_as_string()
return file_string, blob.size
What I am lost on is that with Linux, if I comment out response['Content-Length'] = size from my generate_document() method the download occurs at normal speed, however when I go home and get on Windows with that line commented, the download takes 5 minutes again, and works with the line included.
Can someone help explain where I am going wrong?

Interestingly enough
I fixed the issue by assigning the Content-Length of my response from:
response['Content-Length'] = size
to
response['Content-Length'] = len(response.content)

Related

Multipart uploaded s3 file returns incorrect etag

I have a py function that does a multipart upload where the chunk size (max_file_size_MB) is set to 1MB:
def upload_file_to_aws(self, file_path, filename):
MB = 1024 ** 2
config = TransferConfig(multipart_threshold=self.max_file_size_MB * MB, max_concurrency=10,
multipart_chunksize=self.max_file_size_MB * MB, use_threads=True)
try:
self.client.upload_file(file_path, self.bucket_name, filename,
ExtraArgs={'ContentType': 'application/json'},
Config=config,
Callback=ProgressPercentage(file_path))
except ClientError as e:
logging.error("Error uploading file to AWS. %s", e)
return False
And I'm using the etag calculation algorithm from here
I have the following unit test to make sure the etag calc function works fine which uses a 1.6MB file
#mock_s3
def test_calculated_s3_etag_for_multipart_uploads_is_correct(self):
# Run on a file of 1.6MB
conn = boto3.resource('s3')
conn.create_bucket(Bucket=self.client.bucket_name)
calculated_etag = calculate_s3_etag(SNAPSHOT_DATA_LARGE_FILE)
self.client.upload_file_to_aws(SNAPSHOT_DATA_LARGE_FILE, METADATA_ID)
obj = self.client.get_object(self.client.bucket_name, METADATA_ID)
uploaded_file_etag = obj['ETag']
assert calculated_etag == uploaded_file_etag
but it fails with assert '"bd86ed92459190f807e59b60b6089674-2"' == u'"eaa8602b7154cab4e0787756b2fdc53f-1"'
The calculated etag looks alright but I'm not sure why the S3 etag looks like it was uploaded only in 1 chunk (when it should have been 2)
Any ideas what I might be doing wrong?
EDIT: So all of the above work fine for any size file as long as the chunk size is more than 5MB. If I want to transfer files in chunks less than 5MB I'm not sure what encoding technique is used (i can't use decoded md5 or the reversed engineered etag decoding)

Explaination on error viewing status of Google Drive API v3 upload using next_chunk() in Python?

I am trying to do a resumable upload to google drive using the v3 api. I want to be able to display a status bar of the upload. I can get it to upload easily and quickly if I do not need a status bar because I can use the .execute() function and it uploads. The problems arise when I want to upload the files in chunks. I've seen a few solutions to this on here and other places, but they don't seem to work.
This is my code for uploading:
CHUNK_SIZE = 256*1024
file_metadata = {'name': file_name, 'parents': [folder_id]} #Metadata for the file we are going to upload
media = MediaFileUpload(file_path, mimetype='application/zip',chunksize=CHUNK_SIZE, resumable=True)
file = service.files().create(body=file_metadata, media_body=media, fields='id')
progress = progressBarUpload(file) #create instance off progress bar class
progress.exec_() #execute it
progress.hide() #hide it off the screen after
print(file_name + " uploaded successfully")
return 1 #returns 1 if it was successful
The progress bar calls a thread for my gui which then uses the next_chunk() function, this code is here:
signal = pyqtSignal(int)
def __init__(self, file):
super(ThreadUpload,self).__init__()
self.file = file
def run(self):
done = False
while done == False:
status, done = self.file.next_chunk()
print("status->",status)
if status:
value = int(status.progress() * 100)
print("Uploaded",value,"%")
self.signal.emit(value)
The problem I am getting is that my status = None.
If I use this code it works correctly, but I cannot view the status of the upload using this method. There is a .execute() added which makes it work. I get rid of the next_chunk() part when doing it this way:
CHUNK_SIZE = 256*1024
file_metadata = {'name': file_name, 'parents': [folder_id]} #Metadata for the file we are going to upload
media = MediaFileUpload(file_path, mimetype='application/zip',chunksize=CHUNK_SIZE, resumable=True)
file = service.files().create(body=file_metadata, media_body=media, fields='id').execute()
The first method doesn't work whether I use it in the progress bar thread or not, the second method works both ways every time. I use the progress bar to view the status for downloads and a few other things and it works very well, so I'm pretty confident its the fact my status = None when downloading that is the problem.
Thanks for the help in advance.

The problem is you're comparing the response from the request (done variable in your case) with False, this condition will never return True because the response is either None or an object with the object ID once the upload procces has finished. This is the code I tested and worked succesfully:
CHUNK_SIZE = 256 * 1024
file_metadata = {'name': "Test resumable"} # Metadata for the file we are going to upload
media = MediaFileUpload("test-image.jpg", mimetype='image/jpeg', chunksize=CHUNK_SIZE, resumable=True)
request = service.files().create(body=file_metadata, media_body=media, fields='id')
response = None
while response is None:
status, response = request.next_chunk()
if status:
print(status.progress())
print(response)
print("uploaded successfully")
For your case, you could change these 2 lines:
done = False
while done == False:
For this:
done = None
while done is None:

Why does upload_from_file Google Cloud Storage Function Throws timeout error?

I created a function that works for me, It uploads a file to Google Cloud Storage.
The problem is when my friend tries to upload the same file to the same bucket using the same code from his local machine, he gets timeout error. His internet is very good and he should be able to upload the file with no problems within his connections.
Any idea why is this happening?
def upload_to_cloud(file_path):
"""
saves a file in the google storage. As Google requires audio files greater than 60 seconds to be saved on cloud before processing
It always saves in 'audio-files-bucket' (folder)
Input:
Path of file to be saved
Output:
URI of the saved file
"""
print("Uploading to cloud...")
client = storage.Client().from_service_account_json(KEY_PATH)
bucket = client.get_bucket('audio-files-bucket')
file_name = str(file_path).split('\\')[-1]
print(file_name)
blob = bucket.blob(file_name)
f = open(file_path, 'rb')
blob.upload_from_file(f)
f.close()
print("uploaded at: ", "gs://audio-files-bucket/{}".format(file_name))
return "gs://audio-files-bucket/{}".format(file_name)
It throughs the timeout exception in upload_from_file(f) line.
We tried to use upload_from_filename function, but the same error still occurs.

Now, you can add a custom timeout with blob as this PR is merged, default is 60 seconds. So, if your internet connection is slow, you can add a custom timeout for the upload operation.
def upload_blob(bucket_name, source_file_name, destination_blob_name):
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(destination_blob_name)
blob.upload_from_filename(source_file_name, timeout=300)
print(
"File {} uploaded to {}.".format(
source_file_name, destination_blob_name
)
)

The problem is solved by reducing the chunk size of the blob. The code changed to be:
def upload_to_cloud(file_path):
"""
saves a file in the google storage. As Google requires audio files greater than 60 seconds to be saved on cloud before processing
It always saves in 'audio-files-bucket' (folder)
Input:
Path of file to be saved
Output:
URI of the saved file
"""
print("Uploading to cloud...")
client = storage.Client().from_service_account_json(KEY_PATH)
bucket = client.get_bucket('audio-files-bucket')
file_name = str(file_path).split('\\')[-1]
print(file_name)
blob = bucket.blob(file_name)
## For slow upload speed
storage.blob._DEFAULT_CHUNKSIZE = 2097152 # 1024 * 1024 B * 2 = 2 MB
storage.blob._MAX_MULTIPART_SIZE = 2097152 # 2 MB
with open(file_path, 'rb') as f:
blob.upload_from_file(f)
print("uploaded at: ", "gs://audio-files-bucket/{}".format(file_name))
return "gs://audio-files-bucket/{}".format(file_name)

Google Drive Python API: Uploading Large Files

I am writing a function to upload a file to Google Drive using the Python API client. It works for files up to 1 MB but does not work for a 10-MB file. When I try to upload a 10-MB file, I get an HTTP 400 error. Any help would be appreciated. Thanks.
Here is the output when I print the error:
An error occurred: <HttpError 400 when requesting https://www.googleapis.com/upload/drive/v3/files?alt=json&uploadType=resumable returned "Bad Request">
Here is the output when I print error.resp:
{'server': 'UploadServer',
'status': '400',
'x-guploader-uploadid': '...',
'content-type': 'application/json; charset=UTF-8',
'date': 'Mon, 26 Feb 2018 17:00:12 GMT',
'vary': 'Origin, X-Origin',
'alt-svc': 'hq=":443"; ma=2592000; quic=51303431; quic=51303339; quic=51303338; quic=51303337; quic=51303335,quic=":443"; ma=2592000; v="41,39,38,37,35"',
'content-length': '171'}
I'm unable to interpret this error. I have tried looking at the Google API Error Guide, but their explanation doesn't make sense to me, as all the parameters are the same as those in the requests with smaller files, which work.
Here is my code:
def insert_file_only(service, name, description, filename='', parent_id='root', mime_type=GoogleMimeTypes.PDF):
""" Insert new file.
Using documentation from Google Python API as a guide:
https://developers.google.com/api-client-library/python/guide/media_upload
Args:
service: Drive API service instance.
name: Name of the file to create, including the extension.
description: Description of the file to insert.
filename: Filename of the file to insert.
parent_id: Parent folder's ID.
mime_type: MIME type of the file to insert.
Returns:
Inserted file metadata if successful, None otherwise.
"""
# Set the file meta data
file_metadata = set_file_metadata(name, description, mime_type, parent_id)
# Create media with correct chunk size
if os.stat(filename).st_size <= 256*1024:
media = MediaFileUpload(filename, mimetype=mime_type, resumable=True)
else:
media = MediaFileUpload(filename, mimetype=mime_type, chunksize=256*1024, resumable=True)
file = None
status = None
start_from_beginning = True
num_temp_errors = 0
while file is None:
try:
if start_from_beginning:
# Start from beginning
logger.debug('Starting file upload')
file = service.files().create(body=file_metadata, media_body=media).execute()
else:
# Upload next chunk
logger.debug('Uploading next chunk')
status, file = service.files().create(
body=file_metadata, media_body=media).next_chunk()
if status:
logger.info('Uploaded {}%'.format(int(100*status.progress())))
except errors.HttpError as error:
logger.error('An error occurred: %s' % error)
logger.error(error.resp)
if error.resp.status in [404]:
# Start the upload all over again
start_from_beginning = True
elif error.resp.status in [500, 502, 503, 504]:
# Increment counter on number of temporary errors
num_temp_errors += 1
if num_temp_errors >= NUM_TEMP_ERROR_LIMIT:
return None
# Call next chunk again
else:
return None
permissions = assign_permissions(file, service)
return file
UPDATE
I tried using a simpler pattern, taking the advice from #StefanE. However, I still get an HTML 400 error for files over 1 MB. New code looks like this:
request = service.files().create(body=file_metadata, media_body=media)
response = None
while response is None:
status, response = request.next_chunk()
if status:
logger.info('Uploaded {}%'.format(int(100*status.progress()))
UPDATE 2
I found that the issue is conversion of the file into a Google Document, not uploading it. I'm trying to upload an HTML file and convert it into a Google Doc. This works for files less than ~2 MB. When I only upload the HTML file but not try to convert it, I don't get the abovementioned error. Looks like this corresponds with the limit on this page. I don't know if this limit can be increased.

I see some issues with your code.
First you have a while loop to continue as long file is None and the first thing you do is to set the value of file. i.e it will only loop once.
Secondly you got variable start_from_beginning but that is never set to False anywhere in the code, the else part of the statement will never be executed.
Looking at the Googles documentation their sample code looks a lot more straight forward:
media = MediaFileUpload('pig.png', mimetype='image/png', resumable=True)
request = farm.animals().insert(media_body=media, body={'name': 'Pig'})
response = None
while response is None:
status, response = request.next_chunk()
if status:
print "Uploaded %d%%." % int(status.progress() * 100)
print "Upload Complete!"
Here you loop on while response is None which will be None until finished with the upload.

uploading file contents in python takes a long time

I'm trying to upload files with the following code:
url = "/folder/sub/interface?"
connection = httplib.HTTPConnection('www.mydomain.com')
def sendUpload(self):
fields = []
file1 = ['file1', '/home/me/Desktop/sometextfile.txt']
f = open(file1[1], 'r')
file1.append(f.read())
files = [file1]
content_type, body = self.encode_multipart_formdata(fields, files)
myheaders['content-type'] = content_type
myheaders['content-length'] = str(len(body))
upload_data = urllib.urlencode({'command':'upload'})
self.connection.request("POST", self.url + upload_data, {}, myheaders)
response = self.connection.getresponse()
if response.status == 200:
data = response.read()
self.connection.close()
print data
The encode_multipart_formdata() comes from http://code.activestate.com/recipes/146306/
When I execute the method it takes a long time to complete. In fact, I don't think it will end.. On the network monitor I see that data is transferred, but the method doesn't end...
Why is that? Should I set a timeout somewhere?

You don't seem to be sending the body of your request to the server, so it's probably stuck waiting for content-length bytes to arrive, which they never do.
Are you sure that
self.connection.request("POST", self.url + upload_data, {}, myheaders)
shouldn't read
self.connection.request("POST", self.url + upload_data, body, myheaders)
?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Slow downloads with Google Storage Bucket - python

Interestingly enough I fixed the issue by assigning the Content-Length of my response from: response['Content-Length'] = size to response['Content-Length'] = len(response.content)

Related

Multipart uploaded s3 file returns incorrect etag

Explaination on error viewing status of Google Drive API v3 upload using next_chunk() in Python?

Why does upload_from_file Google Cloud Storage Function Throws timeout error?

Google Drive Python API: Uploading Large Files

uploading file contents in python takes a long time

Categories

Resources