How do I work with large data with a web application? - python

I recently am wrapping up a personal project that involved using flask, python, and pythonanywhere. I learned a lot and now I have some new ideas for personal projects.
My next project involves changing video files and converting them into other file types for example JPGs. when I drafted up how my system could work I quikly realized that the current platform I am using for web application hosting, meaning pythonanywhere, will be too expensive and perhaps even too slow it since I will be working with large files.
I searched around and found AWS S3 for file storage but I am having trouble finding out how I can operate on that data to do my conversions in python. I definitely don't want to download from S3 operate, on the data in Python anywhere, and then reupload the converted files to a bucket. The project will be available for use on the internet so I am trying to make it as robust and scalable as possible.
I found it hard to even word this question on what to ask as I am not too sure if I am even asking the right questions. I guess I am looking for a way to manipulate large data files, preferably in python, without having to work with the data locally if that makes any sense.
I am open to learning new technologies if that is the case and am looking for some direction on how I might achieve this personal project.

Have you looked into AWS Elastic Transcoder?
Amazon Elastic Transcoder lets you convert media files that you have stored in Amazon Simple Storage Service (Amazon S3) into media files in the formats required by consumer playback devices. For example, you can convert large, high-quality digital media files into formats that users can play back on mobile devices, tablets, web browsers, and connected televisions.
Like all things AWS, there are SDKs (e.g. Python SDK) that allow you to programmatically access the service.

Related

Data upload server with user management and resumable uploads

I’m looking to build a web-based data upload server for a citizen science project and am wondering if there are out-of-the box solutions available or if there are some useful Python packages, libraries available to make the job easier? 
I don’t really want to reinvent the wheel and it seems that something like this should already exist. Maybe I’m just looking in the wrong place. 
The brief is that our volunteers make audio recordings to monitor threatened species, then upload their data for archiving and automated processing. I’d like a server that has the following: 
Simple web-based user interface - many of our participants have limited confidence with computers;
No client-side software to install; 
User management: registration to approved email addresses only (or similar, maybe a manual admin approval process); 
Data files are 1 to 40MB in size but there are lots of them ~1000 files, and ~10 GB in total. If user loses network connection, uploads should be recoverable with the server capable of resuming an upload where it left off. That's quite important. 
live progress and status updates to the user. 
I have access to a web hosting server. Maybe a Django or Flask implementation already exists, or there's something similar I could adapt. I've looked at things like Dropbox shared directories but they don't quite fit.

Django Google App Engine Upload files greater than 32mb

I have a Django Rest Framework Project that I've integrated with Django-Storages to upload files to GCS. Everything works locally. However, Google App Engine imposes a hard limit of 32mb on the size of each request, I cannot upload any files greater than this described limit.
I looked into many posts here on StackOverflow and on the internet. Some of the solutions out listed the use of Blobstore API. However, I cannot find a way to integrate this into Django. Another solution describes the use of django-filetransfers but that plugin is obsolete.
I would appreciate it if someone can point me towards an approach I can take to fixing this problem.
PS: I would like to point out that the current setup works likes this. A post request sends the file up to the server which then handles the process of storing the file in google cloud storage. Since Google App Engine restricts request size to 32mb I cannot get to the point of receiving the file. So my issue is that how can I go about uploading these large files.
According with the official documentation[1] cloud storage can manage files until the 5 tb of size, nevertheless, is recommended take a look at the best practices document[2], also there is an example about how to upload objects using python here [3].
[1]https://cloud.google.com/storage/docs/json_api/v1/objects/insert
[2]https://cloud.google.com/storage/docs/best-practices#uploading
[3]https://cloud.google.com/storage/docs/uploading-objects#storage-upload-object-python

Images disappearing (not storing correctly?) after x amount of time [duplicate]

The app I am currently hosting on Heroku allows users to submit photos. Initially, I was thinking about storing those photos on the filesystem, as storing them in the database is apparently bad practice.
However, it seems there is no permanent filesystem on Heroku, only an ephemeral one. Is this true and, if so, what are my options with regards to storing photos and other files?
It is true. Heroku allows you to create cloud apps, but those cloud apps are not "permanent" - they are instances (or "slugs") that can be replicated multiple times on Amazon's EC2 (that's why scaling is so easy with Heroku). If you were to push a new version of your app, then the slug will be recompiled, and any files you had saved to the filesystem in the previous instance would be lost.
Your best bet (whether on Heroku or otherwise) is to save user submitted photos to a CDN. Since you are on Heroku, and Heroku uses AWS, I'd recommend Amazon S3, with optionally enabling CloudFront.
This is beneficial not only because it gets around Heroku's ephemeral "limitation", but also because a CDN is much faster, and will provide a better service for your webapp and experience for your users.
Depending on the technology you're using, your best bet is likely to stream the uploads to S3 (Amazon's storage service). You can interact with S3 with a client library to make it simple to post and retrieve the files. Boto is an example client library for Python - they exist for all popular languages.
Another thing to keep in mind is that Heroku file systems are not shared either. This means you'll have to be putting the file to S3 with the same application as the one handling the upload (instead of say, a worker process). If you can, try to load the upload into memory, never write it to disk and post directly to S3. This will increase the speed of your uploads.
Because Heroku is hosted on AWS, the streams to S3 happen at a very high speed. Keep that in mind when you're developing locally.

Automatically upload photos to a particular Google Photos album

I'm trying to automatically upload JPG photo files from a particular directory on my computer to a particular album on Google Photos. I'd like the photos to periodically get pushed up to Google Photos (every day or so is frequent enough). Google Photos Backup almost does what I want, but it just uploads the files -- it doesn't put them into a particular [pre-existing] album on Google Photos. It's possible that I can somehow use Google Drive and a simple cron job for this, although I don't know how. I am also considering using the Picassa Web Albums API, but that feels overkill and I'd like to avoid that work unless it's necessary. Are there any straightforward solutions to this?
As you said that Google Photo Backup do the (upload) job, in my opinion the best way then is to use directly a Google Apps Script stored inside your Google Drive (running periodicaly) in order to push each new detected pictures inside a particular album.
If you need relative documentation, you may take a look at the album class documentation and also https://developers.google.com/apps-script/
If you need to use an other language to do the job (python, js, etc...) please specify which one and give us also more precision. (mac / windows / linux)
Use IFTTT for this. Google Photos channel perfectly fits for this purpose. https://ifttt.com/applets/DMgPS2uZ-back-up-new-android-photos-you-take-to-google-photos

Storing text files > 1MB in GAE/P

I have a Google App Engine app where I need to store text files that are larger than 1 MB (the maximum entity size.
I'm currently storing them in the Blobstore and I make use of the Files API for reading and writing them. Current operations including uploading them from a user, reading them to process and update, and presenting them to a user. Eventually, I would like to allow a user to edit them (likely as a Google doc).
Are there advantages to storing such text files in Google Cloud Storage, as a Google Doc, or in some other location instead of using the Blobstore?
It really depends on what exactly you need. There are of course advantages of using one service over the other, but in the end it really doesn't matter, since all of the solutions will be almost equally fast and not that expensive. If you will have a huge amount of data after some time you might consider switching to another solution, just because you might save some money.
Having said that, I will suggest you to continue with the Blobstore API, since that will not require extra communication with external services, more secret keys, etc. Security and speed wise it is exactly the same. When you will reach 10K or 100K users you will already going to know if it'actually worth it to store them somewhere else. Continue with what you know best, but just make sure that you're following the right practices when building on Google App Engine.
If you're already using the Files API to read and write the files, I'd recommend you use Google Cloud Storage rather than the Blobstore. GCS offers a richer RESTful API (makes it easier to do things like access control), does a number of things to accelerate serving static data, etc.
Sharing data is more easy in Google Docs (now Google Drive) and Google Cloud Storage. Using Google drive u can also use the power of Google Apps scripts.

Categories

Resources