Set Time Constraint for generated URL of uploaded files

Set Time Constraint for generated URL of uploaded files - python

I am trying to build an application (using Django) which uploads files and generates corresponding URL. Is there some way we can set time constraint for the url, i.e. the uploaded file in url should exist only for little time after the specified time that url should give an error.
I would be using the default django server, in such a case what would be the possible ways to tackle the time constarint problem. I would be glad if you answer for both the cases as for global and individual files, or even a single solution is good :)
~Newbie up with a Herculean Task! Thank You :)

You can have a datetimefield as an additional column and expire it as and when required.

If your uploaded files are being served by the Django app itself, then it's quite easy (and can be solved in different ways depending on weither the "time constraint" is global to all files/urls or not).
Else - that is if the files are served by Apache or anything similar - you'll have to resort to some async mechanism to collect and delete "obsolete" files, either the Q&D way (using a cron job) or with some help from Celery.

Related

What do I need to consider when scaling an application that stores files in the filesystem?

I am interesting in making an app where users can upload large files (~2MB) that are converted into html documents. This application will not have a database. Instead, these html files are stored in a particular writable directory outside of the document source tree. Thus this directory will grow larger and larger as more files are added to it. Users should be able to view these html files by visiting the appropriate url. All security concerns aside, what do I need to be worried about if this directory continues to grow? Will accessing the files inside take longer when there are more of them? Will it potentially crash because of this? Should I create a new directory every 100 files or so to prevent this?
It it is important, I want to make this app using pyramid and python

You might want to partition the directories by user, app or similar so that it's easy to manage anyway - like if a user stops using the service you could just delete their directory. Also I presume you'll be zipping them up. If you keep it well decoupled then you'll be able to change your mind later.
I'd be interested to see how using something like SQLite would work for you, as you could have a sqlite db per partitioned directory.
I presume HTML files are larger than the file they uploaded, so why store the big HTML file.
Things like Mongodb etc are out of the question? as is your app scales with multiple servers you've the issue of accessing other files on a different server, unless you pick the right server in the first place using some technique. Then it's possible you've got servers sitting idle as no one wants there documents.
Why the limitation on just storing files in a directory, is it a POC?
EDIT
I find value in reading things like http://blog.fogcreek.com/the-trello-tech-stack/ and I'd advise you find a site already doing what you do and read about their tech. stack.
As someone already commented why not use Amazon S3 or similar.
Ask yourself realistically how many users do you imagine and really do you want to spend a lot of energy worrying about being the next facebook and trying to do the ultimate tech stack for the backend when you could get your stuff out there being used.
Years ago I worked on a system that stored insurance certificates on the filesystem, we use to run out of inodes.!
Dare I say it's a case of suck it and see what works for you and your app.
EDIT
HAProxy I believe are meant to handle all that load balancing concerns.
As I imagine as a user I wants to http://docs.yourdomain.com/myname/document.doc
although I presume there are security concerns of it being so obvious a name.

This greatly depends on your filesystem. You might want to look up which problems the git folks encountered (also using a sole filesystem based database).
In general, it will be wise do split that directory up, for example by taking the first two or three letters of the file name (or a hash of those) and group the files into subdirectories based on that key. You'd have a structure like:
uploaddir/
00/
files whose name sha1 starts with 00
01/
files whose name sha1 starts with 01
and so on. This takes some load off the filesystem by partitioning the possibly large directories. If you want to be sure that no user can perform an Denial-of-Service-Attack by specifically uploading files whose names hash to the same initial characters, you can also seed the hash differently or salt it or anything like that.
Specifically, the effects of large directories are pretty file-system specific. Some might become slow, some may cope really well, others may have per-directory limits for files.

Django session race condition?

Summary: is there a race condition in Django sessions, and how do I prevent it?
I have an interesting problem with Django sessions which I think involves a race condition due to simultaneous requests by the same user.
It has occured in a script for uploading several files at the same time, being tested on localhost. I think this makes simultaneous requests from the same user quite likely (low response times due to localhost, long requests due to file uploads). It's still possible for normal requests outside localhost though, just less likely.
I am sending several (file post) requests that I think do this:
Django automatically retrieves the user's session*
Unrelated code that takes some time
Get request.session['files'] (a dictionary)
Append data about the current file to the dictionary
Store the dictionary in request.session['files'] again
Check that it has indeed been stored
More unrelated code that takes time
Django automatically stores the user's session
Here the check at 6. will indicate that the information has indeed been stored in the session. However, future requests indicate that sometimes it has, sometimes it has not.
What I think is happening is that two of these requests (A and B) happen simultaneously. Request A retrieves request.session['files'] first, then B does the same, changes it and stores it. When A finally finishes, it overwrites the session changes by B.
Two questions:
Is this indeed what is happening? Is the django development server multithreaded? On Google I'm finding pages about making it multithreaded, suggesting that by default it is not? Otherwise, what could be the problem?
If this race condition is the problem, what would be the best way to solve it? It's an inconvenience but not a security concern, so I'd already be happy if the chance can be decreased significantly.
Retrieving the session data right before the changes and saving it right after should decrease the chance significantly I think. However I have not found a way to do this for the request.session, only working around it using django.contrib.sessions.backends.db.SessionStore. However I figure that if I change it that way, Django will just overwrite it with request.session at the end of the request.
So I need a request.session.reload() and request.session.commit(), basically.

Yes, it is possible for a request to start before another has finished. You can check this by printing something at the start and end of a view and launch a bunch of request at the same time.
Indeed the session is loaded before the view and saved after the view. You can reload the session using request.session = engine.SessionStore(session_key) and save it using request.session.save().
Reloading the session however does discard any data added to the session before that (in the view or before it). Saving before reloading would destroy the point of loading late. A better way would be to save the files to the database as a new model.
The essence of the answer is in the discussion of Thomas' answer, which was incomplete so I've posted the complete answer.

Mark just nailed it, only minor addition from me, is how to load that session:
for key in session.keys(): # if you have potential removals
del session[key]
session.update(session.load())
session.modified = False # just making it clean
First line optional, you only need it if certain values might be removed meanwhile from the session.
Last line is optional, if you update the session, then it does not really matter.

That is true. You can confirm it by having a look at the django.contrib.sessions.middleware.SessionMiddleware.
Basically, request.session is loaded before request hits your view (in process_request), and it is updated in the session backend (if needed) after the response has left your view (in process_response).
If what I mean is unclear, you might want to have a look at the django documentation for Middleware.
The best way to solve the issue will depend on what you're trying to achieve with that information. I'll update my answer if you provide that information!

Force GAE to update all files on deployment

Is there a way to force GAE to upload and update all files, even if it thinks it doesn't require any updations?
Clarification - If I make quick back-to-back updates, I find that certain files, that were definitely modified, refuse to be updated online. Apart from assigning version numbers to force the update, which is very painful, is there another way?
EDIT - I'm referring to javascript files

Those files do get updated, you don't see the update because of caching that happens in some where a long the chain. In order to get the latest files load the file with a slug (e.g. http://myapp.com/scripts/script.js?slug) and update the slug each time you deploy your application.

App engine app design questions

I want to load info from another site (this part is done), but i am doing this every time the page is loaded and that wont do. So i was thinking of having a variable in a table of settings like 'last checked bbc site' and when the page loads it would check if its been long enough since last check to check again. Is there anything silly about doing it that way?
Also do i absolutely have to use tables to store 1 off variables like this setting?

I think there are 2 options that would work for you, besides creating a entity in the datastore to keep track of "last visited time".
One way is to just check the external page periodically, using the cron api as described by jldupont.
The second way is to store the last visited time in memcache. Although memcache is not permanent, it doesn't have to be if you are only storing last refresh times. If your entry in memcache were to disappear for some reason, the worst that would happen would be that you would fetch the page again, and update memcache with the current date/time.
The first way would be best if you want to check the external page at regular intervals. The second way might be better if you want to check the external page only when a user clicks on your page, and you haven't fetched that page yourself in the recent past. With this method, you aren't wasting resources fetching the external page unless someone is actually looking for data related to it.

You could also use Scheduled Tasks.
Also, you don't absolutely need to use the Datastore for configuration parameters: you could have this in a script / config file.

If you want some handler on your GAE app (including one for a scheduled task, reception of messages, web page visits, etc) to store some new information in such a way that some handler in the future can recover that information, then GAE's storage is the only good general way (memcache could expire from under you, for example). Not sure what you mean by "tables" (?!), but guessing that you actually mean GAE's storage the answer is "yes". (Under very specific circumstances you might want to put that data to some different place on the network, such as your visitor's browser e.g. via cookies, or an Amazon storage instance, etc, but it does not appear to me that those specific circumstances are appliable to your use case).

Caching data from other websites in Django

Suppose I have a simple view which needs to parse data from an external website.
Right now it looks something like this:
def index(request):
source = urllib2.urlopen(EXTERNAL_WEBSITE_URL)
bs = BeautifulSoup.BeautifulSoup(source.read())
finalList = [] # do whatever with bs to populate the list
return render_to_response('someTemplate.html', {'finalList': finalList})
First of all, is this an acceptable use?
Obviously, this is not good performance-wise. The external website page is pretty big, and I am only extracting a small part of it. I thought of two solutions:
Do all of this asynchronously. Load the rest of the page, populate with data once I get it. But I don't even know where to start. I'm just starting with Django and never done anything async up until now.
I don't care if this data is updated every 2-3 minutes, so caching is a good solution as well (also saves me the extra round-trips). How would I go about caching this data?

First, don't optimize prematurely. Get this to work.
Then, add enough logging to see what the performance problems (if any) really are.
You may find that end-user's PC is the slowest part; getting data from another site may, actually, be remarkably fast when you do not fetch .JS libraries and .CSS and artwork and the render then entire thing in a browser.
Once you're absolutely sure that the fetch of the remote content really IS a problem. Really. Then you have to do the following.
Write a "crontab" script that does the remote fetch form time to time.
Design a place to cache the remote results. Database or file system, pick one.
Update your Django app to get the data from the cache (database or filesystem) instead of the remote URL.
Only after you have absolute proof that the urllib2 read of the remote site is the bottleneck.

Caching with django is pretty easy,
from django.core.cache import cache
key = 'some-key'
data = cache.get(key)
if data is None:
# soupify the page and what not
cache.set(data, key, 60*60*8)
return render_to_response ...
return render_to_response
To answer your questions, you can do this asynchronously, but then you would have to use something like django cron to update the cache ever so often. On the other hand you can write this as a standalone python script, replace the cache imported from django with memcache and it would work the same way. It would reduce some of the performance issues your site could have, and as long as you know the cache key, you can retrieve the data from the cache.
Like Jarret said I would read django's caching docs and memcache's docs for more information.

Django has robust, built-in support for caching views: http://docs.djangoproject.com/en/dev/topics/cache/#topics-cache.
It offers solutions for caching entire views (such as in your case), or just certain parts of data in the view. There are even controls for how often to update the cache, and so forth.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.