I have a Django app that is basically a web front end on a database.
Every now and then, users upload files containing perhaps 1000s of records. These records will need to be parsed out of the file, processed, and used to create new records or update existing records in the database. I'm just wondering what is the better approach for processing the uploaded file:
in the view (while the user waits - I guess this could be a up to 5 minues) ?
save the uploaded file and have some background cron job call a custom admin command to process it? This seems the most sensible to me.
or perhaps another method I haven't thought of?
Celery seems to be pretty hot these days too, you should definitely look into this:
https://github.com/ask/django-celery
http://celeryproject.org/
Send an email when done, or have the front end poll for results every X seconds after submission. "Are we there yet?" "Are we there yet?"
I'd like to know too a simple, safe way to start a thread that writes to the db.
Related
I'm trying to add real-time features to my Django webapp. Basically, i want to show real time data on a webpage.
I have an external Python script which generates some JSON data, not big data but around 10 records per second. On the other part, i have a Django app, i would like my Django app to receive that data and show it on a HTML page in real time. I've already considered updating the data on a db and then retrieving it from Django, but i would have too many queries, since Django would query the DB 1+ times per second for every user and my external script would be writing a lot of data every second.
What i'm missing is a "central" system, a way to make these two pieces communicate. I know the question is probably not specific enough, but is there some way to do this? I know something about Django Channels, but i don't know if i could do what i want with it; i've also considered updating the data on a RabbitMQ queue and then retrieve it from Django, but this is not the best use of RabbitMQ.
So is there a way to do this with Django-Channels? Any kind of advice is appreciated.
I would suggest using Django Channels. You can also use Redis instead of RabbitMQ. In your case, Redis might be a better choice.
Here is an approach: http://www.maxburstein.com/blog/realtime-django-using-nodejs-and-socketio/
background
I aim at overing a project which is made up of django and celery.
And I code two tasks which would sipder from two differnt web and save some data to database —— mysql.
As before, I do just one task, and I use update_or_create shows enough.
But when I want to do more task in different workers ,and all of them may save data to database, and what's more, they may save the same data at the same time.
question
How can I ensure different tasks running in different worker makeing no repeating data when all of them try to save the same data?
I know a django API is select_for_update which would set a lock in database. But when I reading the documentions, it similar means, would select,if exist,update. But I want to select,if exist,update,else create. It would more be like update_or_create,but this API may not use a lock?
about sorry
May the user anwered my question before had give me the right answer, but I did not get what they means.
what I choose
Finally I use the redis lock to ensure no repeating data.
The logic just below:
when I get the data,I try to use 'set(key,value.nx=True,ex=60)' to get a lock from redis.
If the answer is True,I would try to use the django queryqpi 'update_or_create()'.
If not,I did nothing and return True.
It make the burst problem like a single process.
I'm looking for a way to constantly check my database (MySQL) for new entries. Once a new entry is committed I want to output it in a webpage using Flask.
Since the process takes time to finish I would like to give the users the impression it took only few seconds to retrieve data.
For now I'm waiting that the whole process finishes to give to the user the whole result. But I would prefer to update the result web-page every time a new entry was added to the DB. So for example the first entry is added to the DB, immediately the user can see it on the web-page, then a second entry is added the user can now see both the first and the second entries on the web-page and so on. I don't know if it has to come from flask or other ways
Any idea?
You can set MySQL to log all commits to General Query Log and monitor all changes (for example via Watchdog or PyNotify). Once the file changes, you can parse the new log entries and get the signal. By this way you'll avoid pooling for changes.
The better way would be of course send the signal while storing data to the database.
I want to load info from another site (this part is done), but i am doing this every time the page is loaded and that wont do. So i was thinking of having a variable in a table of settings like 'last checked bbc site' and when the page loads it would check if its been long enough since last check to check again. Is there anything silly about doing it that way?
Also do i absolutely have to use tables to store 1 off variables like this setting?
I think there are 2 options that would work for you, besides creating a entity in the datastore to keep track of "last visited time".
One way is to just check the external page periodically, using the cron api as described by jldupont.
The second way is to store the last visited time in memcache. Although memcache is not permanent, it doesn't have to be if you are only storing last refresh times. If your entry in memcache were to disappear for some reason, the worst that would happen would be that you would fetch the page again, and update memcache with the current date/time.
The first way would be best if you want to check the external page at regular intervals. The second way might be better if you want to check the external page only when a user clicks on your page, and you haven't fetched that page yourself in the recent past. With this method, you aren't wasting resources fetching the external page unless someone is actually looking for data related to it.
You could also use Scheduled Tasks.
Also, you don't absolutely need to use the Datastore for configuration parameters: you could have this in a script / config file.
If you want some handler on your GAE app (including one for a scheduled task, reception of messages, web page visits, etc) to store some new information in such a way that some handler in the future can recover that information, then GAE's storage is the only good general way (memcache could expire from under you, for example). Not sure what you mean by "tables" (?!), but guessing that you actually mean GAE's storage the answer is "yes". (Under very specific circumstances you might want to put that data to some different place on the network, such as your visitor's browser e.g. via cookies, or an Amazon storage instance, etc, but it does not appear to me that those specific circumstances are appliable to your use case).
I have a file that contains ~16,000 lines of information on entities. The user is supposed to upload the file using an HTML upload form, then the system handles this by reading line by line and creating then put()'ing entities onto the datastore.
I'm limited by the 30 second request time limit. I have tried a lot of different work-arounds using Task Queue, forced HTML redirecting, etc. and nothing has worked for me.
I am using forced HTML redirecting to delete all data and this works, albeit VERY slowly. (4th answer here: Delete all data for a kind in Google App Engine)
I can't seem to apply this to my uploading problem, since my method has to be a POST method. Is there a solution somehow? Sample code would be much appreciated since I'm very new to web development in general.
To solve a similar problem, I stored the dataset in a model with a single TextProperty, then spawn a taskqueue task that:
Fetches a dataset from the datastore if there are any left.
Checks if the length of the dataset is <= N, where N is some small number of entities you can put() without a timeout. I used 5. If so, write the individual entities, delete the dataset record, and spawn a new copy of the task.
If the dataset size was bigger than N, split it into N parts in the same format and write those to the datastore, delete the original entity, and spawn a new copy of the task.
If you're doing this to bulk load data, why not use the bulk loader?
If you need the interface to be accessible to non-admin users, then, as suggested, you need to break the file up into decent sized chunks (by taking blocks of n lines each) put them into the datastore, and start a task to deal with each of them.