Does GAE GCS write have asynchronous version like the NDB functions - python

Does GAE GCS write have asynchronous version like the NDB functions(eg, put_async)?
I found out(from appstats) that since I am uploading multiple files sequentially in my code, it is consuming a lot a time(all of them being sequential). I am trying to reduce this time since with six files it is ~7 seconds and I wish to have more files.
I have put the following code snippet in a for loop which iterates over all the files selected by the user on the webpage:
gcs_file = gcs.open (filename, 'w', content_type = 'image/jpeg')
gcs_file.write (photo)
gcs_file.close ()
Please let me know if you need any more data
UPDATE
Original code:
photo_blobkey_list = []
video_blobkey_list = []
i = 0
for photo in photo_list:
filename = bucket + "/user_pic_"+str (user_index) + "_" + str (i)
# Store the file in GCS
gcs_file = gcs.open (filename, 'w', content_type = 'image/jpeg')
gcs_file.write (photo)
gcs_file.close ()
# Store the GCS filename in Blobstore
blobstore_filename = '/gs' + filename
photo_blobkey = blobstore.create_gs_key (blobstore_filename)
photo_blobkey_list.append (photo_blobkey)
i = i + 1
i = 0
for video in video_list:
filename = bucket + "/user_video_"+str (user_index) + "_" + str (i)
filename = filename.replace (" ", "_")
# Store the file in GCS
gcs_file = gcs.open (filename, 'w', content_type = 'video/avi')
gcs_file.write (video)
gcs_file.close ()
# Store the GCS filename in Blobstore
blobstore_filename = '/gs' + filename
video_blobkey = blobstore.create_gs_key (blobstore_filename)
video_blobkey_list.append (video_blobkey)
i = i + 1
::
user_record.put ()
My Doubt:
Based on the suggestions, I plan to put the GCS write part in a tasklet subroutine which takes filename and the photo/video as argument. How do I use "yield" here so as to make the above code run in parallel as much as possible(all the photo and video write operations)? As per the docs, I need to provide all the arguments to yield so as to make the tasklets run in parallel. But, in my case, the number of photos and videos uploaded is variable number.
Relevant official docs text:
If that were two separate yield statements, they would happen in
series. But yielding a tuple of tasklets is a parallel yield: the
tasklets can run in parallel and the yield waits for all of them to
finish and returns the results.

You can do the write operation in a task.

There's async versions of the URLFetch API.
You could use this to write async direct to cloud storage using the cloud storage XML or JSON API's.

Related

Download a single file from a shared Dropbox's folder without having the share link of this file

As the title says, I have access to a shared folder where some files are uploaded. I just want to donwload an specific file, called "db.dta". So, I have this script:
def download_file(url, filename):
url = url
file_name = filename
with open(file_name, "wb") as f:
print("Downloading %s" % file_name)
response = requests.get(url, stream=True)
total_length = response.headers.get('content-length')
if total_length is None: # no content length header
f.write(response.content)
else:
dl = 0
total_length = int(total_length)
for data in response.iter_content(chunk_size=4096):
dl += len(data)
f.write(data)
done = int(50 * dl / total_length)
sys.stdout.write("\r[%s%s]" % ('=' * done, ' ' * (50-done)) )
sys.stdout.flush()
print(" ")
print('Descarga existosa.')
It actually download shares links of files if I modify the dl=0 to 1, like this:
https://www.dropbox.com/s/ajklhfalsdfl/db_test.dta?dl=1
The thing is, I dont have the share link of this particular file in this shared folder, so if I use the url of the file preview, I get an error of denied access (even if I change dl=0 to 1).
https://www.dropbox.com/sh/a630ksuyrtw33yo/LKExc-MKDKIIWJMLKFJ?dl=1&preview=db.dta
Error given:
dropbox.exceptions.ApiError: ApiError('22eaf5ee05614d2d9726b948f59a9ec7', GetSharedLinkFileError('shared_link_access_denied', None))
Is there a way to download this file?
If you have the shared link to the parent folder and not the specific file you want, you can use the /2/sharing/get_shared_link_file endpoint to download just the specific file.
In the Dropbox API v2 Python SDK, that's the sharing_get_shared_link_file method (or sharing_get_shared_link_file_to_file). Based on the error output you shared, it looks like you are already using that (though not in the particular code snippet you posted).
Using that would look like this:
import dropbox
dbx = dropbox.Dropbox(ACCESS_TOKEN)
folder_shared_link = "https://www.dropbox.com/sh/a630ksuyrtw33yo/LKExc-MKDKIIWJMLKFJ"
file_relative_path = "/db.dat"
res = dbx.sharing_get_shared_link_file(url=folder_shared_link, path=file_relative_path)
print("Metadata: %s" % res[0])
print("File data: %s bytes" % len(res[1].content))
(You mentioned both "db.dat" and "db.dta" in your question. Make sure you use whichever is actually correct.)
Additionally, note if you using a Dropbox API app registered with the "app folder" access type: there's currently a bug that can cause this shared_link_access_denied error when using this method with an access token for an app folder app.

Django 1.11 download file chunk by chunk

In my case, I have the Django 1.11 server acting as a proxy. When you click "download" from the browser, it sends a request to the django proxy that downloads files from another server and processes them, after which they must "send" them to the browser to allow the user to download them. My proxy downloads and processes the files chunks by chunks.
How can I send chunks to the browser as they are ready so that the user finally downloads a single file?
In practice, I have to let you download a file that is not yet ready, like a stream.
def my_download(self, res)
# some code
file_handle = open(local_path, 'wb', self.chunk_size)
for chunk in res.iter_content(self.chunk_size):
i = i+1
print("index: ", i, "/", chunks)
if i > chunks-1:
is_last = True
# some code on the chunk
# Here, instead of saving the chunk locally, I would like to allow it to download it directly.
file_handle.write(chunk)
file_handle.close()
return True
Thank you in advance, greetings.
This question should be flagged as duplicate of this post: Serving large files ( with high loads ) in Django
Always try to find the answer before you create a question in SO, please!
Essentially the answer is included in Django's Documentation: "Streaming Large CSV files" example and we will apply the above question into that example:
You can use Django's StreamingHttpResponse and Python's wsgiref.util.FileWrapper to serve a large file in chunks effectivelly and without loading it in memory.
def my_download(request):
file_path = 'path/to/file'
chunk_size = DEFINE_A_CHUNK_SIZE_AS_INTEGER
filename = os.path.basename(file_path)
response = StreamingHttpResponse(
FileWrapper(open(file_path, 'rb'), chunk_size),
content_type="application/octet-stream"
)
response['Content-Length'] = os.path.getsize(file_path)
response['Content-Disposition'] = "attachment; filename=%s" % filename
return response
Now if you want to apply some processing to the file chunk-by-chunk you can utilize FileWrapper's generated iterator:
Place your chunk processing code in a function which MUST return the chunk:
def chunk_processing(chunk):
# Process your chunk here
# Be careful to preserve chunk's initial size.
return processed_chunk
Now apply the function inside the StreamingHttpResponse:
response = StreamingHttpResponse(
(
process_chunk(chunk)
for chunk in FileWrapper(open(file_path, 'rb'), chunk_size
),content_type="application/octet-stream"
)

How to get data from s3 and do some work on it? python and boto

I have a project task to use some output data I have already produced on s3 in an EMR task. So previously I have ran an EMR job that produced some output in one of my s3 buckets in the form of multiple files named part-xxxx. Now I need to access those files from within my new EMR job, read the contents of those files and by using that data I need to produce another output.
This is the local code that does the job:
def reducer_init(self):
self.idfs = {}
for fname in os.listdir(DIRECTORY): # look through file names in the directory
file = open(os.path.join(DIRECTORY, fname)) # open a file
for line in file: # read each line in json file
term_idf = JSONValueProtocol().read(line)[1] # parse the line as a JSON object
self.idfs[term_idf['term']] = term_idf['idf']
def reducer(self, term_poster, howmany):
tfidf = sum(howmany) * self.idfs[term_poster['term']]
yield None, {'term_poster': term_poster, 'tfidf': tfidf}
This runs just fine locally, but the problem is the data i need now is on s3 and i need to access it somehow in reducer_init function.
This is what I have so far, but it fails while executing on EC2:
def reducer_init(self):
self.idfs = {}
b = conn.get_bucket(bucketname)
idfparts = b.list(destination)
for key in idfparts:
file = open(os.path.join(idfparts, key))
for line in file:
term_idf = JSONValueProtocol().read(line)[1] # parse the line as a JSON object
self.idfs[term_idf['term']] = term_idf['idf']
def reducer(self, term_poster, howmany):
tfidf = sum(howmany) * self.idfs[term_poster['term']]
yield None, {'term_poster': term_poster, 'tfidf': tfidf}
AWS access info is defined as follows:
awskey = '*********'
awssecret = '***********'
conn = S3Connection(awskey, awssecret)
bucketname = 'mybucket'
destination = '/path/to/previous/output'
There are two ways of doing this :
Download the file into your local system and parse it. ( Kinda simple, quick and easy )
Get data stored on S3 into memory and parse it ( a bit more complex in case of huge files ).
Step 1:
On S3 filenames are stored as a Key, if you have a file named "Demo" stored in a folder named "DemoFolder" then the key for that particular file would be "DemoFolder\Demo".
Use the below code to download the file into a temp folder.
AWS_KEY = 'xxxxxxxxxxxxxxxxxx'
AWS_SECRET_KEY = 'xxxxxxxxxxxxxxxxxxxxxxxxxx'
BUCKET_NAME = 'DemoBucket'
fileName = 'Demo'
conn = connect_to_region(Location.USWest2,aws_access_key_id = AWS_KEY,
aws_secret_access_key = AWS_SECRET_KEY,
is_secure=False,host='s3-us-west-2.amazonaws.com'
)
source_bucket = conn.lookup(BUCKET_NAME)
''' Download the file '''
for name in source_bucket.list():
if name.name in fileName:
print("DOWNLOADING",fileName)
name.get_contents_to_filename(tempPath)
You can then work on the file in that temp path.
Step 2:
You can also fetch data as string using data = name.get_contents_as_string(). In case of huge files (> 1gb) you may come across memory errors, to avoid such errors you will have to write a lazy function which reads the data in chunks.
For example you can use range to fetch a part of file using data = name.get_contents_as_string(headers={'Range': 'bytes=%s-%s' % (0,100000000)}).
I am not sure if I answered your question properly, I can custom code for your requirement once I get some time. Meanwhile please feel free to post any query you have.

Python script hangs when executing long running query, even after query completes

I've got a Python script that loops through folders and within each folder, executes the sql file against our Redshift cluster (using psycopg2). Here is the code that does the loop (note: this works just fine for queries that take only a few minutes to execute):
for folder in dir_list:
#Each query is stored in a folder by group, so we have to go through each folder and then each file in that folder
file_list = os.listdir(source_dir_wkly + "\\" + str(folder))
for f in file_list:
src_filename = source_dir_wkly + "\\" + str(folder) + "\\" + str(f)
dest_filename = dest_dir_wkly + "\\" + os.path.splitext(os.path.basename(src_filename))[0] + ".csv"
result = dal.execute_query(src_filename)
result.to_csv(path_or_buf=dest_filename,index=False)
execute_query is a method stored in another file:
def execute_query(self, source_path):
conn_rs = psycopg2.connect(self.conn_string)
cursor = conn_rs.cursor(cursor_factory=psycopg2.extras.RealDictCursor)
sql_file = self.read_sql_file(source_path)
cursor.execute(sql_file)
records = cursor.fetchall()
conn_rs.commit()
return pd.DataFrame(data=records)
def read_sql_file(self, path):
sql_path = path
f = open(sql_path, 'r')
return f.read()
I have a couple queries that take around 15 minutes to execute (not unusual given the size of the data in our Redshift cluster), and they execute just fine in SQL Workbench. I can see in the AWS Console that the query has completed, but the script just hangs and doesn't dump the results to a csv file, nor does it proceed to the next file in the folder.
I don't have any timeouts specified. Is there anything else I'm missing?
The line records = cursor.fetchall() is likely the culprit. It reads all data and hence loads all results from the query into memory. Given that your queries are very large, that data probably cannot all be loaded into memory at once.
You should iterate over the results from the cursor and write into your csv one by one. In general trying to read all data from a database query at once is not a good idea.
You will need to refactor your code to do so:
for record in cursor:
csv_fh.write(record)
Where csv_fh is a file handle to your csv file. Your use of pd.DataFrame will need rewriting as it looks like it expects all data to be passed to it.

Downloading and zipping files from amazon

I'm currently storing all my photos on amazon s3 and using django for my website. I want a to have a button that allows users to click it and have all their photos zipped and returned to them.
I'm currently using boto to interface with amazon and found that I can go through the entire bucket list / use get_key to look for specific files and download them
After this I would need temporarily store them, then zip and return.
What is the best way to go about doing this?
Thanks
you can take a look at this question or at this snippet to download the file
# This is not a full working example, just a starting point
# for downloading images in different formats.
import subprocess
import Image
def image_as_png_pdf(request):
output_format = request.GET.get('format')
im = Image.open(path_to_image) # any Image object should work
if output_format == 'png':
response = HttpResponse(mimetype='image/png')
response['Content-Disposition'] = 'attachment; filename=%s.png' % filename
im.save(response, 'png') # will call response.write()
else:
# Temporary disk space, server process needs write access
tmp_path = '/tmp/'
# Full path to ImageMagick convert binary
convert_bin = '/usr/bin/convert'
im.save(tmp_path+filename+'.png', 'png')
response = HttpResponse(mimetype='application/pdf')
response['Content-Disposition'] = 'attachment; filename=%s.pdf' % filename
ret = subprocess.Popen([ convert_bin,
"%s%s.png"%(tmp_path,filename), "pdf:-" ],
stdout=subprocess.PIPE)
response.write(ret.stdout.read())
return response
to create a zip follow the link that i gave you, you can also use zipimport as shown here examples are on the bottom of the page, follow the documentation for newer versions
you might also be interested in this although it was made for django 1.2, it might not work on 1.3
Using python-zipstream as patched with this pull request you can do something like this:
import boto
import io
import zipstream
import sys
def iterable_to_stream(iterable, buffer_size=io.DEFAULT_BUFFER_SIZE):
"""
Lets you use an iterable (e.g. a generator) that yields bytestrings as a
read-only input stream.
The stream implements Python 3's newer I/O API (available in Python 2's io
module). For efficiency, the stream is buffered.
From: https://stackoverflow.com/a/20260030/729491
"""
class IterStream(io.RawIOBase):
def __init__(self):
self.leftover = None
def readable(self):
return True
def readinto(self, b):
try:
l = len(b) # We're supposed to return at most this much
chunk = self.leftover or next(iterable)
output, self.leftover = chunk[:l], chunk[l:]
b[:len(output)] = output
return len(output)
except StopIteration:
return 0 # indicate EOF
return io.BufferedReader(IterStream(), buffer_size=buffer_size)
def iterate_key():
b = boto.connect_s3().get_bucket('lastage')
key = b.get_key('README.markdown')
for b in key:
yield b
with open('/tmp/foo.zip', 'w') as f:
z = zipstream.ZipFile(mode='w')
z.write(iterable_to_stream(iterate_key()), arcname='foo1')
z.write(iterable_to_stream(iterate_key()), arcname='foo2')
z.write(iterable_to_stream(iterate_key()), arcname='foo3')
for chunk in z:
print "CHUNK", len(chunk)
f.write(chunk)
Basically we iterate over the key contents using boto, convert this iterator to a stream using the iterable_to_stream method from this answer and then have python-zipstream create a zip file on-the-fly.

Categories

Resources