How to change mongodb default cleanup time for TTL index?

How to change mongodb default cleanup time for TTL index? - python

I want to set TTL around 2-3 months so its clearly infeasible to check after every 60 sec for TTL indexex. I want to reduce overhead by checking TTL once in a day. Is there any way to manually/programmatically define this time?

Based on my knowledge it is impossible to do this. Some time ago I was looking for this option but have not found anything apart of disabling it completely.
I am inclined to think that this is impossible to modify because in TTL documentation it is told explicitly that:
The background task that removes expired documents runs every 60
seconds.
and there is no parameter in server configuration which makes anything similar.
P.S. I understand that you see this as a waste of resources, but I would start to worry about it only when I will see a bottleneck regarding to it.
P.P.S and if you would found that this is a bottleneck, you can implement your own cleanup (put a script which removes all documents later than some timestamp) and put it once per day on cron job.

Related

How to limit number of rows in mysql and if exceeds then remove older row

I am using MySQL database via python for storing logs.
I was wondering if there is any efficient way to remove the oldest row if the num of rows exceeds the limit.
I was able to do this by executing a query to find total rows and then delete the older ones by arranging them in ascending and deleting. But this method is taking too much time. Is there a way to make this efficient by making a rule while creating a table, so that MySQL itself takes care if the limit exceeds?
Thanks in advance.

Well, there's no simple and built-in way to do this in MySQL.
Solutions that use triggers to delete old rows when you insert a new row are risky, because the trigger might fail. Or the transaction that spawned the trigger might be rolled back. In either of these cases, your intended deletion will not happen.
Also putting the burden of deleting on the thread that inserts new data causes extra work for the insert request, and usually we'd prefer not to make things slower for our current users.
It's more common to run an asynchronous job periodically to delete older data. This can be scheduled to run at off-hours, and run in batches. It also gives more flexibility to archive old data, or execute retries if the deletion or archiving fails or is interrupted.
MySQL does support an EVENT system, so you can run a stored routine based on a schedule. But you can only do tasks you can do in a stored routine, and it's not easy to make it do retries, or archive to any external system (e.g. cloud archive), or notify you when it's done.
Sorry there is no simple solution. There are just too many variations on how people would like it to work, and too many edge cases of potential failure.
The way I'd implement this is to use cron or else a timer thread in my web service to check the database, say once per hour. If it finds the number of rows is greater than the limit, it deletes the oldest rows in modestly sized batches (e.g. 1000 rows at a time) until the count is under the threshold.
I like to write scheduled jobs in a way that can be easily controlled and monitored. So I can make it run immediately if I want, and I can disable or resume the schedule if I want, and I can view a progress report about how much it deleted the last time it ran, and how long until the next time it runs, etc.

Django run tasks (possibly) in the far future

Suppose I have a model Event. I want to send a notification (email, push, whatever) to all invited users once the event has elapsed. Something along the lines of:
class Event(models.Model):
start = models.DateTimeField(...)
end = models.DateTimeField(...)
invited = models.ManyToManyField(model=User)
def onEventElapsed(self):
for user in self.invited:
my_notification_backend.sendMessage(target=user, message="Event has elapsed")
Now, of course, the crucial part is to invoke onEventElapsed whenever timezone.now() >= event.end.
Keep in mind, end could be months away from the current date.
I have thought about two basic ways of doing this:
Use a periodic cron job (say, every five minutes or so) which checks if any events have elapsed within the last five minutes and executes my method.
Use celery and schedule onEventElapsed using the eta parameter to be run in the future (within the models save method).
Considering option 1, a potential solution could be django-celery-beat. However, it seems a bit odd to run a task at a fixed interval for sending notifications. In addition I came up with a (potential) issue that would (probably) result in a not-so elegant solution:
Check every five minutes for events that have elapsed in the previous five minutes? seems shaky, maybe some events are missed (or others get their notifications send twice?). Potential workaroung: add a boolean field to the model that is set to True once notifications have been sent.
Then again, option 2 also has its problems:
Manually take care of the situation when an event start/end datetime is moved. When using celery, one would have to store the taskID (easy, ofc) and revoke the task once the dates have changed and issue a new task. But I have read, that celery has (design-specific) problems when dealing with tasks that are run in the future: Open Issue on github. I realize how this happens and why it is everything but trivial to solve.
Now, I have come across some libraries which could potentially solve my problem:
celery_longterm_scheduler (But does this mean I cannot use celery as I would have before, because of the differend Scheduler class? This also ties into the possible usage of django-celery-beat... Using any of the two frameworks, is it still possible to queue jobs (that are just a bit longer-running but not months away?)
django-apscheduler, uses apscheduler. However, I was unable to find any information on how it would handle tasks that are run in the far future.
Is there a fundemantal flaw with the way I am approaching this? Im glad for any inputs you might have.
Notice: I know this is likely to be somehwat opinion based, however, maybe there is a very basic thing that I have missed, regardless of what could be considered by some as ugly or elegant.

We're doing something like this in the company i work for, and the solution is quite simple.
Have a cron / celery beat that runs every hour to check if any notification needs to be sent.
Then send those notifications and mark them as done. This way, even if your notification time is years ahead, it will still be sent. Using ETA is NOT the way to go for a very long wait time, your cache / amqp might loose the data.
You can reduce your interval depending on your needs, but do make sure they dont overlap.
If one hour is too huge of a time difference, then what you can do is, run a scheduler every hour. Logic would be something like
run a task (lets call this scheduler task) hourly that gets all notifications that needs to be sent in the next hour (via celery beat) -
Schedule those notifications via apply_async(eta) - this will be the actual sending
Using that methodology would get you both of best worlds (eta and beat)

Having a function run at random time intervals, web2py

Im currently making a program that would send random text messages at randomly generated times during the day. I first made my program in python and then realized that if I would like other people to sign up to receive messages, I would have to use some sort of online framework. (If anyone knowns a way to use my code in python without having to change it that would be amazing, but for now I have been trying to use web2py) I looked into scheduler but it does not seem to do what I have in mind. If anyone knows if there is a way to pass a time value into a function and have it run at that time, that would be great. Thanks!

Check out the Apscheduler module for cron-like scheduling of events in python - In their example it shows how to schedule some python code to run in a cron'ish way.
Still not sure about the random part though..
As for a web framework that may appeal to you (seeing you are familiar with Python already) you should really look into Django (or to keep things simple just use WSGI).
Best.

I think that actually you can use Scheduler and Tasks of web2py. I've never used it ;) but the documentation describes creation of a task to which you can pass parameters from your code - so something you need - and it should work fine for your needs:
scheduler.queue_task('mytask', start_time=myrandomtime)
So you need web2py's cron job, running every day and firing code similar to the above for each message to be sent (passing parameters you need, possibly message content and phone number, see examples in web2py book). This would be a daily creation of tasks which would be processed later by the scheduler.
You can also have a simpler solution, one daily cron job which prepares the queue of messages with random times for the next day and the second one which runs every, like, ten minutes, checks what awaits to be processed and sends messages. So, no Tasks. This way is a bit ugly though (consider a single processing which takes more then 10 minutes). You may also want to have and check some statuses of the messages to be processed (like pending, ongoing, done) to prevent a situation in which two jobs are working on the same message and to allow tracking progress of the processing. Anyway, you could use the cron method it in an early version of your software and later replace it by a better method :)
In any case, you should check expected number of messages to process and average processing time on your target platform - to make sure that the chosen method is quick enough for your needs.

This is an old question but in case someone is interested, the answer is APScheduler blocking scheduler with jobs set to run in regular intervals with some jitter
See: https://apscheduler.readthedocs.io/en/3.x/modules/triggers/interval.html

fail2ban performance considerations regarding log size, complexity, and finditme settings

I'm new to fail2ban and having a hard time figuring out performance considerations for different configurations I'm thinking about setting up. This is running on a raspberry pi board, so performance is a concern.
The obvious optimizations I can think of are using efficient regular expressions and only the minimum number of jails needed. I guess my specific questions are:
How does resource usage increase with respect to findtime values? I'm guessing very small and very large values could both impact the server in different ways regarding RAM vs. CPU.
Similarly, how does the size of a log file and the number of different log files monitored by fail2ban impact overall resource usage?
As an example, this jail would let someone try 3,600 SSH login passwords a day if they figured out the fail2ban config and adjusted their script timing to accommodate.
[ssh]
enabled = true
action = iptables-allports[name=ssh]
filter = sshd
logpath = /var/log/auth.log
maxretry = 6
findtime = 120
If we changed findtime to a different extreme of 86400 (1 day), it would only allow 5 attempts a day, but now it's monitoring a larger portion of the log file. How does this affect resource usage?
Another example, a jail for POST flood attacks:
[apache-post-flood]
enabled = true
action = iptables-allports[name=apache-post-flood]
filter = apache-post-flood
logpath = /var/log/apache2/*access.log
maxretry = 10
findtime = 10
Here, we have the opposite, where the findtime counter is resetting every 10 seconds. It's also monitoring all *access logs (I'm guessing, again, I'm new to this). That could mean it's monitoring access.log, other_vhosts_access.log, and perhaps an https_access.log for https portions of the site. What if it's been a busy day and these files are all 10-20mb each?
Hope this helps explain what's on my mind. Thanks in advance for your help.

There is only one way to find out this, test it, nothing else.
Add monitoring regarding memory usage if needed, but there is no formula that would tell you the amount of CPU, IO or memory that you will need.
As a general rule, when you retune your system, put in a comment the new value and the date when you needed to retune it. This will allow you to see if there is any trend.
My personal take is to increase the affected resource with 30-50% each time. If you use less than this you are risking on doing it too often.

exiting a program with a cached exit code

I have a "healthchecker" program, that calls a "prober" every 10 seconds to check if a service is running. If the prober exits with return code 0, the healthchecker considers the tested service fine. Otherwise, it considers it's not working.
I can't change the healthchecker (I can't make it check with a bigger interval, or using a better communication protocol than spawning a process and checking its exit code).
That said, I don't want to really probe the service every 10 seconds because it's overkill. I just wanna probe it every minute.
My solution to that is to make the prober keep a "cache" of the last answer valid for 1 minute, and then just really probe when this cache expires.
That seems fine, but I'm having trouble thinking on a decent approach to do that, considering the program must exit (to return an exit code). My best bet so far would be to transform my prober in a daemon (that will keep the cache in memory) and create a client to just query it and exit with its response, but it seems too much work (and dealing with threads, and so on).
Another approach would be to use SQLite/memcached/redis.
Any other ideas?

Since no one has really proposed anything I'll drop my idea here. If you need an example let me know and I'll include one.
The easiest thing to do would be to serialize a dictionary that contains the system health and last time.time() it was checked. At the beginning of your program unpickle the dictionary, check the time, if it's less then your 60 second time interval, quit. Otherwise check the health like normal and cache it (with the time).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.