I have a website I am looking to stay updated with and scrape some content from there every day. I know the site is updated manually at a certain time, and I've set cron schedules to reflect this, but since it is updated manually it could be 10 or even 20 minutes later.
Right now I have a hack-ish cron update every 5 minutes, but I'd like to use the deferred library to do things in a more precise manner. I'm trying to chain deferred tasks so I can check if there was an update and defer that same update a for couple minutes if there was none, and defer again if need be until there is finally an update.
I have some code I thought would work, but it only ever defers once, when instead I need to continue deferring until there is an update:
(I am using Python)
class Ripper(object):
def rip(self):
if siteHasNotBeenUpdated:
deferred.defer(self.rip, _countdown=120)
else:
updateMySite()
This was just a simplified excerpt obviously.
I thought this was simple enough to work, but maybe I've just got it all wrong?
The example you give should work just fine. You need to add logging to determine if deferred.defer is being called when you think it is. More information would help, too: How is siteHasNotBeenUpdated set?
Related
Currently, I'm using Google's 2-step method to backup the datastore and than import it to BigQuery.
I also reviewed the code using pipeline.
Both methods are not efficient and have high cost since all data is imported everytime.
I need only to add the records added from last import.
What is the right way of doing it?
Is there a working example on how to do it in python?
You can look at Streaming inserts. I'm actually looking at doing the same thing in Java at the moment.
If you want to do it every hour, you could maybe add your inserts to a pull queue (either as serialised entities or keys/IDs) each time you put a new entity to Datastore. You could then process the queue hourly with a cron job.
There is no full working example (as far as I know), but I believe that the following process could help you :
1- You'd need to add a "last time changed" to your entities, and update it.
2- Every hour you can run a MapReduce job, where your mapper can have a filter to check for last time updated and only pick up those entities that were updated in the last hour
3- Manually add what needs to be added to your backup.
As I said, this is pretty high level, but the actual answer will require a bunch of code. I don't think it is suited to Stack Overflow's format honestly.
I would like to recreate some data in my project every 30 minutes (prices that change). also I got another job that needs to refresh every minute.
Now I heard I should use a daemon. but I'm not sure how that works.
Can someone put me into the right direction.
Also should i make an extra model to save that temporary data or is that part of the daemon?
PS: not sure if stack overflow can be used for this sort of questions, but i don't know where to search for this sort of information
You don't want a daemon. You just want cron jobs.
The best thing to do is to write your scripts as custom Django management commands and use cron to trigger them to run at the specified intervals.
I'm Trying to optimize my code, and I got into a problem I don't quite understand. On every page of my web app, there will be a list of notifications much like Facebook's new ticker. So, on every request, I run this code in the beggining:
notification_query = db.Query(Ticker, keys_only=True)\
.filter('friends =',self.current_user.key().name())
self._notifications_future = notification_query.run()
Then, where I find a good spot, I call the middle function which is:
notification_keys = [future.parent() for future in self._notifications_future]
self._notifications = db.get_async(notification_keys)
Finally I fetch them all at the end:
context.update({'notifications': self._notifications.get_result() })
Every thing works great except for this: If I call the middle function in the end of the request's function, I get this:
Dumb spot
And If I call it in what I think is an optimized spot, I get this:
Smart spot
As you can see the API usage is doubled by making this "optimization". What is going on here?
Call number 2, is the first snippet in both cases. Call number 12 in dumb spot is the second snippet, and call number 12 in the smart spot is the second snippet. This last switch, has nothing to do with the problem, I've tested it.
pd: Does google charge me for time the query is "idle"?
UPDATE
The problem seems to be just in the dev_server, when I tried the same example (smart version) on appspot, I got this:
Queries on appspot
Here everything works as expected, things that get called with run() or get_async() don't block other stuff. As I said the issue is only in dev_server. Still it would be nice to see this feature working on localhost, for more effective profiling.
In addition to Peter's note, you should bear in mind that Appstats has no way to know when an asynchronous request actually completes, except when you fetch the result. As a result, even fast calls will look slow if you take a long time to request the result.
Ahhhh. It is very helpful when posting about app engine to mention if your results are on the dev server or in production. The dev server does not have the same performance characteristics as the production servers, so it is not the best way to profile your application. In fact, I believe that indexes are not used at all, and that the dev server is single threaded, so you can't handle concurrent requests with it. In fact, if your app makes calls to itself, it won't work at all!
I would like to write a tiny calendar-like application for someone as a birthday present (to be run on Ubuntu). All it should do is display a separate picture each day, so whenever it's invoked it should check the date and select the appropriate picture from the collection I would provide, but also, in case it just keeps running, it should switch to the next picture when the next day begins.
The date-checking on invocation isn't the problem; my question pertains to the second case: how can I have the program notice the beginning of the next day? My clumsy approach would be to make it check the current date at regular intervals and let it change the displayed picture once there was a change in date, but that strikes me as very roundabout and not particularly elegant.
In case any of you have got some idea of how I could accomplish this, please don't hesitate to reply. I would aim to write the application in either Perl or Python, so suggestions concerning those two languages would be most welcome, but any other suggestions would be appreciated as well.
Thanks a lot for your time!
The answer to this could be very system dependant. Controlling the time at which your program is executed is likely to be system dependant. On all *nix type systems, I would use cron. Assuming for a moment that you are using a *nix system, the answer then depends on what the program actually does.
If it only needs to select an image, then I would suggest that it not be run continuously, but terminates itself after selecting it, and is then run again the next day (there are a lot of tutorials on how to setup cron).
If, however, it has some form of UI and it is likely (read possible) to keep running for several days, then you can follow two approaches:
Create your program as it is, to poll periodically for the current time, and do a date delta comparison. Python timedelta objects could help here. This is pretty much your inelegant approach.
The other solution would be to send it a signal from cron when you do wish it to update. This process would mean that you would have to make it signal aware, and respond to something like USR1. The Python docs talk to this, but you can find many tutorials on the web. This approach also works quite nicely for daemonised apps.
I'm sure there are many other approaches too, but those are the ones that come to mind for a quickish and nastyish app.
Did you think about scheduling the invoke of your script?
For me, the best approach is this:
1.Have two options to run the script:
run_script
run_script --update
2.Schedule the update run in some task scheduler (for example Cron) to be executed daily.
3.When you would want to check the image for current day, simply run the script without update option.
If you would like me to extend any part of these, simply ask about it.
I am using Django and am making some long running processes that I am just interacting with through my web user interface. Such as, they would be running all the time, checking a database value every few minutes and stopping only if this has changed (would be boolean true false). So, I want to be able to use Django to interact with these, however am unsure of the way to do this. When I used to use PHP I had some method of doing this, figure it would be even easier to do in Python but am not able to find anything on this with my searches.
Basically, all I want to be able to do is to execute python code without waiting for it to finish, so it just begins execute then goes on to do whatever else it needs for django, quickly returning a new page to the user.
I know that there are ways to call an external program, so I suppose that may be the only way to go? Is there a way to do this with just calling other python code?
Thanks for any advice.
Can't vouch for it because I haven't used it yet, but "Celery" does pretty much what you're asking for and was originally built specifically for Django.
http://celeryproject.org/
Their example showing a simple task adding two numbers:
from celery.decorators import task
#task
def add(x, y):
return x + y
You can execute the task in the background, or wait for it to finish:
>>> result = add.delay(8, 8)
>>> result.wait() # wait for and return the result
16
You'll probably need to install RabbitMQ also to get it working, so it might be more complicated of a solution than you're looking for, but it will achieve your goals.
You want an asynchronous message manager. I've got a tutorial on integrating Gearman with Django. Any pickleable Python object can be sent to Gearman, which will do all the work and post the results wherever you want; the tutorial includes examples of posting back to the Django database (it also shows how to use the ORM outside of Django).