How can I refresh my python file in every five minutes in django while running because every hour the data I'm webscraping is changing and I need to change the value of the variable?
You need is a task that runs periodically and a cron job solve that. I recomment to you take a look to django cron or Celerity, both are excelent options to create scheduled tasks.
I recommend using database like sqlite3 is better solution than restart django(web service) per hours.
You can store some datas in database and django can get them like using variables.
The real problem here is that you're fetching the data at the start of the application and keeping it in memory. I see 2 possible better methods:
move the data scraping code to your view function. This means you'll re-scrape on every call ensuring you'll always have the freshest data but at the cost of speed (the time it takes to make the request to your target url).
better yet: same as above except you cache the results locally. This could also be kept in memory (although I'd use a file or database if you're running multiple django app instances to ensure they're all using the same data). With the least amount of change to what you have already, an in memory cache could be achieved with simply adding a timestamp variable that gets the current time on each fetch. If last fetch was more than X minutes ago: refetch your data.
Related
I've been trying to find a solution to my issue for some time now, but I haven't come across anything that seems intuitive enough that it seems like the "right" solution.
I'm building an electron app that uses django as the backend. The backend is responsible for running some long processes that are time critical. For example, I have a loop that continuously takes data for about 5 seconds. When I run that function standalone, it takes a data point about every 10 ms, however, when I run it through django it takes a data point anywhere from 10 ms to 70 ms or even longer. This makes sense to me intuitively because django is sharing thread time to keep responding to the frontend. However, this is unacceptable for my application.
The delays seem to be related to returning data to the frontend. Basically, there's a static variable in the view class that's a container for the result data, then the measurement is triggered from the front end and the measurement populates that static variable with data. When the measurement is running, the front end queries django once a second for updated data so it can plot it for the user to track progress.
I first tried using threading.Thread to create a new thread to run the measurement. This thread gets triggered by the django view, but that doesn't fix the issue. Ok, maybe this makes sense too because the thread is still sharing processing time with the main thread?
The next step seems to be creating an entirely new subprocess. However, I'd still like to be passing data back to the front end while the script runs, so short of dropping a file, I don't know how to do that.
Is there an officially supported way to run a function through django who's execution time won't be impacted by the fact that it's being triggered from django?
I am using MySQL database via python for storing logs.
I was wondering if there is any efficient way to remove the oldest row if the num of rows exceeds the limit.
I was able to do this by executing a query to find total rows and then delete the older ones by arranging them in ascending and deleting. But this method is taking too much time. Is there a way to make this efficient by making a rule while creating a table, so that MySQL itself takes care if the limit exceeds?
Thanks in advance.
Well, there's no simple and built-in way to do this in MySQL.
Solutions that use triggers to delete old rows when you insert a new row are risky, because the trigger might fail. Or the transaction that spawned the trigger might be rolled back. In either of these cases, your intended deletion will not happen.
Also putting the burden of deleting on the thread that inserts new data causes extra work for the insert request, and usually we'd prefer not to make things slower for our current users.
It's more common to run an asynchronous job periodically to delete older data. This can be scheduled to run at off-hours, and run in batches. It also gives more flexibility to archive old data, or execute retries if the deletion or archiving fails or is interrupted.
MySQL does support an EVENT system, so you can run a stored routine based on a schedule. But you can only do tasks you can do in a stored routine, and it's not easy to make it do retries, or archive to any external system (e.g. cloud archive), or notify you when it's done.
Sorry there is no simple solution. There are just too many variations on how people would like it to work, and too many edge cases of potential failure.
The way I'd implement this is to use cron or else a timer thread in my web service to check the database, say once per hour. If it finds the number of rows is greater than the limit, it deletes the oldest rows in modestly sized batches (e.g. 1000 rows at a time) until the count is under the threshold.
I like to write scheduled jobs in a way that can be easily controlled and monitored. So I can make it run immediately if I want, and I can disable or resume the schedule if I want, and I can view a progress report about how much it deleted the last time it ran, and how long until the next time it runs, etc.
I have a problem with my Dash application put in a server of a remote office. Two users running the app will experience interactions with each other due to table import followed by table pricing (the code for pricing is around 10,000 lines and pull out 8 tables). While looking on the internet, I saw that to solve this problem, it was enough to create html.Div preceded by the conversation of dataframes in JSON. However, this solution is not possible because I have to store 9 tables totaling 200,000 rows and 500 columns. So, I looked into the cache solution. However, this option does not create errors but increases the execution time of the program considerably. Going from a table of 20,000 vehicles to 200,000 it increases the compute time by almost * 1,000 and it is horrible every time I change the settings of the graphics.
I use cache filesystem and i used the exemple 4 of this : https://dash.plotly.com/sharing-data-between-callbacks. By doing some time calculations, I noticed that it is not accessing the cache that is the problem (about 1sec) but converting the JSON tables to dataframe (almost 60 seconds per callback). About 60 seconds is the time also corresponding to the pricing, so it is the same to call the cache in a callback as it is to price in a callback.
1/ do you have an idea that would save a dataframe not a JSON in the form of a cache or with a technique like the invisible html.Div or a cookie system or whatever other methods ?
2/ with the Redis or Memcached, we have to provide return json?
2/ If so, how do we set it up, taking example 4 from the previous link because I have an error "redis.exceptions.ConnectionError: Error 10061 connecting to localhost: 6379. No connection could be established because l target computer expressly refused it. " ?
3/ Do you also know if turning off the application automatically deletes the cache without following the default_timeout?
I think your issue can be solved using dash_extensions and specifically server side call back caches, might be worth a shot to implement.
https://community.plotly.com/t/show-and-tell-server-side-caching/42854
I use a python script (on a linux web server) to redirect a user on request. The redirection is based on a database (a python dictionary) and the database itself is builded from a remote CSV file.
For now, I have to manually update the database but the CSV file can change at any time.
I'm looking for a way to update the database after each user request (after 10 sec). In this way, the database is always up to date and the user do not suffer from the update.
I'm trying with the shed module but it doesn't work.
import sched, time
s = sched.scheduler(time.time, time.sleep)
s.enter(0, 1, app.redirect, ())
s.enter(10, 1, app.data_base_update, ())
s.run()
The goal is to keep the url redirection fast for the user and delay an update later. Is there a good solution to do it with a unique script file?
You would be better served by updating a COPY in the background, and instantly switching them and making the updated copy into the live copy. Thus there would be no wait for the user, and you could do so at any time. You are probably best not doing so 10sec after each user request (imagine a flood of requests... it will bring your server to its knees). You can schedule a cron script or other automated task to do so, every minute or half hour etc.; depending on the size of the task you can also limit the CPU utilization.
Note that your solution still doesn't ensure the database is always up to date, since you are working with remote data. But that is unfortunately the price of working with remote data. Make sure not to hammer the remote servers if they do not belong to you. =)
I'm developing software using the Google App Engine.
I have some considerations about the optimal design regarding the following issue: I need to create and save snapshots of some entities at regular intervals.
In the conventional relational db world, I would create db jobs which would insert new summary records.
For example, a job would insert a record for every active user that would contain his current score to the "userrank" table, say, every hour.
I'd like to know what's the best method to achieve this in Google App Engine. I know that there is the Cron service, but does it allow us to execute jobs which will insert/update thousands of records?
I think you'll find that snapshotting every user's state every hour isn't something that will scale well no matter what your framework. A more ordinary environment will disguise this by letting you have longer running tasks, but you'll still reach the point where it's not practical to take a snapshot of every user's data, every hour.
My suggestion would be this: Add a 'last snapshot' field, and subclass the put() function of your model (assuming you're using Python; the same is possible in Java, but I don't know the syntax), such that whenever you update a record, it checks if it's been more than an hour since the last snapshot, and if so, creates and writes a snapshot record.
In order to prevent concurrent updates creating two identical snapshots, you'll want to give the snapshots a key name derived from the time at which the snapshot was taken. That way, if two concurrent updates try to write a snapshot, one will harmlessly overwrite the other.
To get the snapshot for a given hour, simply query for the oldest snapshot newer than the requested period. As an added bonus, since inactive records aren't snapshotted, you're saving a lot of space, too.
Have you considered using the remote api instead? This way you could get a shell to your datastore and avoid the timeouts. The Mapper class they demonstrate in that link is quite useful and I've used it successfully to do batch operations on ~1500 objects.
That said, cron should work fine too. You do have a limit on the time of each individual request so you can't just chew through them all at once, but you can use redirection to loop over as many users as you want, processing one user at a time. There should be an example of this in the docs somewhere if you need help with this approach.
I would use a combination of Cron jobs and a looping url fetch method detailed here: http://stage.vambenepe.com/archives/549. In this way you can catch your timeouts and begin another request.
To summarize the article, the cron job calls your initial process, you catch the timeout error and call the process again, masked as a second url. You have to ping between two URLs to keep app engine from thinking you are in a accidental loop. You also need to be careful that you do not loop infinitely. Make sure that there is an end state for your updating loop, since this would put you over your quotas pretty quickly if it never ended.