Compress data before storage on Google App Engine

Compress data before storage on Google App Engine - python

I im trying to store 30 second user mp3 recordings as Blobs in my app engine data store. However, in order to enable this feature (App Engine has a 1MB limit per upload) and to keep the costs down I would like to compress the file before upload and decompress the file every time it is requested. How would you suggest I accomplish this (It can happen in the background by the way via a task queue but an efficient solution is always good)
Based on my own tests and research - I see two possible approaches to accomplish this
Zlib
For this I need to compress a certain number of blocks at a time using a While loop. However, App Engine doesnt allow you to write to the file system. I thought about using a Temporary File to accomplish this but I havent had luck with this approach when trying to decompress the content from a Temporary File
Gzip
From reading around the web, it appears that the app engine url fetch function requests content gzipped already and then decompresses it. Is there a way to stop the function from decompressing the content so that I can just put it in the datastore in gzipped format and then decompress it when I need to play it back to a user on demand?
Let me know how you would suggest using zlib or gzip or some other solution to accmoplish this. Thanks

"Compressing before upload" implies doing it in the user's browser -- but no text in your question addresses that! It seems to be about compression in your GAE app, where of course the data will only be after the upload. You could do it with a Firefox extension (or other browsers' equivalents), if you can develop those and convince your users to install them, but that has nothing much to do with GAE!-) Not to mention that, as #RageZ's comment mentions, MP3 is, essentially, already compressed, so there's little or nothing to gain (though maybe you could, again with a browser extension for the user, reduce the MP3's bit rate and thus the file's dimension, that could impact the audio quality, depending on your intended use for those audio files).
So, overall, I have to second #jldupont's suggestion (also in a comment) -- use a different server for storage of large files (S3, Amazon's offering, is surely a possibility though not the only one).

While the technical limitations (mentioned in other answers) of compressing MP3 files via standard compression or reencoding at a lower bitrate are correct, your aim is to store 30 seconds of MP3 encoded data. Assuming that you can enforce that on your users, you should be alright without applying additional compression techniques if the MP3 bitrate is 256kbit constant bitrate (CBR) or lower. At 256kbit CBR, 30 seconds of audio would require:
(((256 * 1000) / 8) * 30) / 1048576 = 0.91MB
The maximum standard bitrate is 320kbit which equates to 1.14MB, so you'd have to use 256 or less. The most commonly used bitrate in the wild is 128kbits.
There are additional overheads that will increase the final file size such as ID3 tags and framing, but you should be OK. If not, drop down to 224kbits as your maximum (30 secs = 0.80MB). There are other complexities such as variable bit rate encoding for which the file size is not so predictable and I am ignoring these.
So your problem is no longer how to compress MP3 files, but how to ensure that your users are aware that they can not upload more than 30 seconds encoded at 256kbits CBR, and how to enforce that policy.

You could try the new Blobstore API allowing the storage and serving of files up to 50MB
http://www.cloudave.com/link/the-new-google-app-engine-blobstore-api-first-thoughts
http://code.google.com/appengine/docs/python/blobstore/
http://code.google.com/appengine/docs/java/blobstore/

As Aneto mentions in a comment, you will not be able to compress MP3 data with a standard compression library like gzip or zlib. However, you could reencode the MP3 at a MUCH lower bitrate, possible with LAME.

You can store up to 10Mb with a list of Blobs. Search for google file service.
It's much more versatile than BlobStore in my opinion, since I just started using BlobStore Api yesterday and I'm still figuring out if it is possible to access the data bytewise.. as in changing doc to pdf, jpeg to gif..
You can storage Blobs of 1Mb * 10 = 10 Mb (max entity size I think), or you can use BlobStore API and get the same 10Mb or get 50Mb if you enable billing (you can enable it but if you don't pass the free quota you don't pay).

Related

Best approach to handle large files on django website

Good morning all.
I have a generic question about the best approach to handle large files with Django.
I created a python project where the user is able to read a binary file (usually the size is between 30-100MB). Once the file is read, the program processes the file and shows relevant metrics to the user. Basically it outputs the max, min, average, std of the data.
At the moment, you can only run this project from the cmd line. I'm trying to create a user interface so that anyone can use it. I decided to create a webpage using django. The page is very simple. The user uploads files, he then selects which file he wants to process and it shows the metrics to the user.
Working on my local machine I was able to implement it. I upload the files (it saves on the user's laptop and then it processes it). I then created an S3 account, and now the files are all uploaded to S3. The problem that I'm having is that when I try to get the file (I'm using smart_open (https://pypi.org/project/smart-open/)) it is really slow to read the file (for a 30MB file it's taking 300sec), but if I download the file and read it, it only takes me 8sec.
My question is: What is the best approach to retrieve files from S3, and process them? I'm thinking of simply downloading the file to my server, process it, and then deleting it. I've tried this on my localhost and it's fast. Downloading from S3 takes 5sec and processing takes 4sec.
Would this be a good approach? I'm a bit afraid that for instance if I have 10 users at the same time and each one creates a report then I'll have 10*30MB = 300MB of space that the server needs. Is this something practical, or will I fill up the server?
Thank you for your time!
Edit
To give a bit more of a context, what's making it show is the f.read() line. Due to the format of the binary file. I have to read the file in the following way:
name = f.read(30)
unit = f.read(5)
data_length = f.read(2)
data = f.read(data_length) <- This is the part that is taking a lot of time when I read it directly from S3. If I download the file, then this is super fast.

All,
After some experimenting, I found a solution that works for me.
with open('temp_file_name', 'wb') as data:
s3.download_fileobj(Bucket='YOURBUCKETNAME', Key='YOURKEY', data)
read_file('temp_file_name')
os.remove('temp_file_name')
I don't know if this is the best approach or what are the possible downfalls of this approach. I'll use it and come back to this post if I end up using a different solution.
The problem with my previous approach was that f.read() was taking too long, the problem seems to be that every time I need to read a new line, the program needs to connect to S3 (or something) and this is taking too long. What ended up working for me, was to download the file directly to my server, read it, and then deleting it once I read the file. Using this solution I was able to get the speeds that I was getting when working on a localserver (reading directly from my laptop).
If you are working with medium size files (30-50mb) this approach seems to work. My only concern is if we try to download a really large file if the server will run out of disk space.

Export .wav from Audio Segment to AWS S3 Bucket

I'm using IBM's Text-to-Speech API to run speaker detection. I used pydub to concatenate several .wav files into one, but I cannot pass an AudioSegment to IBM.
My questions are:
Can I export my file directly to an AWS S3 bucket, as I can later retrieve from there?
How else could I pass the AudioSegment? Can I encode it differently as a variable, so exporting it without saving it in memory, if that makes sense?
This is the formats IBM can read
application/octet-stream
audio/alaw (Required. Specify the sampling rate (rate) of the audio.)
audio/basic (Required. Use only with narrowband models.)
audio/flac
audio/g729 (Use only with narrowband models.)
audio/l16 (Required. Specify the sampling rate (rate) and optionally the number of channels (channels) and endianness (endianness) of the audio.)
audio/mp3
audio/mpeg
audio/mulaw
audio/ogg
audio/ogg;codecs=opus
audio/ogg;codecs=vorbis
audio/wav
audio/webm
audio/webm;codecs=opus
audio/webm;codecs=vorbis
I love pydub and it's been an amazing tool to work with so far. Thank you for making it!

Since you are using python anyway, you could use smart_open to treat a remote file in your object storage just like a locale one. This would allow you to stream the parts of the file to the os without having all of them in memory at once. Any format should be fine for the Objectstorage.

How to pass large data between systems

I have pricing data that is stored in XML format that is being generated on an hourly basis. It is roughly 100MB in size, if stored as XML. I need to send this data to my main system in order to process this. In the future, it is also possible that this data size is sent ever 1m.
What would be the best way to send this data? My thinking thus far was:
- It would be too large to send as JSON to a POST endpoint
- Possible to send it as XML and store it on my server
Is there a more optimal way to do this?

As mentioned in the answer by Michael Anderson, you could possibly send only a diff of the changes across each system.
One way to do this is to introduce a protocol such as git.
With git, you could:
Generate the data on the first system and push to a private repo
Have your second system pull the changes
This would be much more efficient than pulling the entire copy of the data every time.
It would also be compressed, and over an encrypted channel (depending on the git server/service)

Assuming you're on linux and the data already is written somewhere in your filesystem, why not just do a simple scp or rsync in a crontab entry?
You probably want to compress before sending, or enable compression in the protocol.
If your data only changes slightly, you could also try sending a patch against the previous version (generated with diff) instead of the entire data, and then regenerating on the other end.

Is there an add-on to auto compress files while uploading into Plone?

Is there any add-on which will activate while uploading files into the Plone site automatically? It should compress the files and then upload into the files. These can be image files like CAD drawings or any other types. Irrespective of the file type, beyond a specific size, they should get compressed and stored, rather than manually compressing the files and storing them.I am using plone 4.1. I am aware of the css, javascript files which get compressed, but not of uploaded files. I am also aware of the 'image handling' in the 'Site Setup'

As Maulwurfn says, there is no such add-on, but this would be fairly straightforward for an experienced developer to implementing using a custom content type. You will want to be pretty sure that the specific file types you're hoping to store will actually benefit from compression (many modern file formats already include some compression, and so simply zipping them won't shrink them much).
Also, unless you implement something complex like a client-side Flash uploader with built-in compression, Plone can only compress files after they've been uploaded, not before, so if you're hoping to make uploads quicker for users, rather than to minimize storage space, you're facing a somewhat more difficult challenge.

How to upload huge files from Nokia 95 to webserver?

I'm trying to upload a huge file from my Nokia N95 mobile to my webserver using Pys60 python code. However the code crashes because I'm trying to load the file into memory and trying to post to a HTTP url. Any idea how to upload huge files > 120 MB to webserver using Pys60.
Following is the code I use to send the HTTP request.
f = open(soundpath + audio_filename)
fields = [('timestamp', str(audio_start_time)), ('test_id', str(test_id)), ('tester_name', tester_name), ('sensor_position', str(sensor_position)), ('sensor', 'audio') ]
files = [('data', audio_filename, f.read())]
post_multipart(MOBILE_CONTEXT_HOST, MOBILE_CONTEXT_SERVER_PORT, '/MobileContext/AudioServlet', fields, files)
f.close

where does this post_multipart() function comes from ?
if it is from here, then it should be easy to adapt the code so that it takes a file object in argument and not the full content of the file, so that post_mutipart reads small chunks of data while posting instead of loading the whole file in memory before posting.
this is definitely possible.

You can't. It's pretty much physically impossible. You'll need to split the file into small chunks and upload it bit by bit, which is very difficult to do quickly and efficiently on that sort of platform.
Jamie

You'll need to craft a client code to split your source file in small chunks and rebuild that pieces server-side.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.