How to automatically delete collections in MongoDB after 3 days? - python

I'm relatively new in MongoDB- I've done stuff in it before, but my current project involves using collections to store values per "key". In this case- a key is referring to a string of characters that will be used to access my software. The authentication and key generation will be done on my website using Flask as the backend, which means I can use Python to handle all the key generation and authentication stuff. I have the code complete for the most part, it's able to generate and authenticate keys amazingly, and I'm really happy with how it works. Now the problem I face is getting the collection or key to automatically delete after 3 days.
The reason I want them to delete after 3 days is because the keys aren't lifetime keys. The software is a free software, but in order to use it you must have a key. That key should expire after a certain amount of time (in this case, 3 days) and the user must go back and get another one.
Please note that I can't use invidual documents, as one I've already set it up to use collections and two it needs to store multiple documents as compared to one document.
I've already tried TTL on the collection but it doesn't seem to be working.
What's the best way to do this? Keep in mind the collection name is the key itself so it can't have a date of deletion in it (a date that another code scans and when that date is met the collection is deleted).

Related

Is it possible to generate hash from a queryset?

My idea is to create a hash of a queryset result. For example, product inventory.
Each update of this stock would generate a hash.
This use would be intended to only request this queryset in the API, when there is a change (example: a new product in invetory).
Example for this use:
no change, same hash - no request to get queryset
there was change, different hash. Then a request will be made.
This would be a feature designed for those who are consuming the data and not for the Django that is serving.
Does this make any sense? I saw that in python there is a way to generate a hash from a tuple, in my case it would be to use the frozenset and generate the hash. I don't know if it's a good idea.
I would comment, but I'm waiting on the 50 rep to be able to do that. It sounds like you're trying to cache results so you aren't querying on data that hasn't been changed. If you're not familiar with caching, the idea is to save hard-to-compute answers in memory for frequently queried endpoints/functions.
For example, if I had a program that calculated the first n digits of pi, I may choose to save a map of [digit count -> value] so that if 10 people asked me for the first thousand, I would only calculate it once. Redis is a popular option for caching, and I believe it exists for Django. It allows you to cache some information, set a time before expiration on it, and then wipe specific parts of that information (to force it to recalculate) every time something specific changes (like a new product in inventory).
Everybody should try writing their own cache at least once, like what you're describing, but the de facto professional option is to use a caching library. Your idea is good, it will definitely work, and you will probably want a dict of [hash->result] for each hash, where result is the information you would send back over your API. If you plan to save data so it persists across multiple program starts, remember Python forces random seeds for hashes, causing inconsistent values. Check out this post for more info.

Is there a way to set TTL on a document within AWS Elasticsearch utilizing python library?

I can't find anyway to setup TTL on a document within AWS Elasticsearch utilizing python elasticsearch library.
I looked at the code of the library itself, and there are no argument for it, and I yet to see any answers on google.
There is none, you can use the index management policy if you like, which will operate at the index level, not at the doc level. You have a bit of wriggle room though in that you can create a pattern data-* and have more than 1 index, data-expiring-2020-..., data-keep-me.
You can apply a template to the pattern data-expiring-* and set a transition to delete an index after lets say 20 days. If you roll over to a new index each day you will the oldest day being deleted at the end of the day once it is over 20 days.
This method is much more preferable because if you are deleting individual documents that could consume large amounts of your cluster's capacity, as opposed to deleting entire shards. Other NoSQL databases such as DynamoDB operate in a similar fashion, often what you can do is add another field to your docs such as deletionDate and add that to your query to filter out docs which are marked for deletion, but are still alive in your index as a deletion job has not yet cleaned them up. That is how the TTL in DynamoDB behaves as well, data is not deleted the moment the TTL expires it, but rather in batches to improve performance.

Reset Index in neo4j using Python

Is there a possibility to reset the indices once I deleted the nodes just as if deleted the whole folder manually?
I am deleting the whole database with node.delete() and relation.delete() and just want the indices to start at 1 again and not where I had actually stopped...
I assume you are referring to the node and relationship IDs rather than the indexes?
Quick answer: You cannot explicitly force the counter to reset.
Slightly longer answer: Generally speaking, these IDs should not carry any relevance within your application. There have been a number of discussions about this within the Neo4j mailing list and Stack Overflow as the ID is an internal artifact and should not be used like a primary key. It's purpose is more akin to an in-memory address and if you require unique identifiers, you are better off considering something like a UUID.
You can stop your database, delete all the files in the database folder, and start it again.
This way, the ID generation will start back from 1.
This procedure completely wipes your data, so handle with care.
Now you certainly can do this using Python.
see https://stackoverflow.com/a/23310320

How do I expire keys in dynamoDB with Boto?

I'm trying to move from redis to dynamoDB and sofar everything is working great! The only thing I have yet to figure out is key expiration. Currently, I have my data setup with one primary key and no range key as so:
{
"key" => string,
"value" => ["string", "string"],
"timestamp" => seconds since epoch
}
What I was thinking was to do a scan over the database for where timestamp is less than a particular value, and then explicitly delete them. This, however, seems extremely inefficient and would use up a ridiculous number of read/write units for no reason! On top of which, the expirations would only happen when I run the scan, so they could conceivably build up.
So, has anyone found a good solution to this problem?
I'm also using DynamoDB like the way we used to use Redis.
My suggestion is to write the key into different time-sliced tables.
For example, say a type of record should last few minutes, at most less an hour, then you can
Create a new table every day for this type of record and store new records in today's table.
Use a read repair tip when you read records, which means if you can't find a record in today's table, you try to find it in yesterday's table and put in today's table if necessary.
If you find the record in either table, verify it with it's timestamp. It's not necessary to delete expired records at this moment.
Drop entire stale tables in your tasks.
This is easier to maintain and cost-efficient.
You could do lazy expiration and delete it on request.
For example:
store key "a" with an attribute "expiration", expires in 10 minutes.
fetch in 9 minutes, check expiration, return it.
fetch in 11 minutes. check expiration. since it's less than now, delete the entry.
This is what memcached was doing when I looked at the source a few years ago.
You'd still need to do a scan to remove all the old entries.
You could also consider using Elasticache, which is meant for caching rather than a permanent data store.
It seems that Amazon just added expiration support to DynamoDB (as of feb 27 2017). Take a look at the official blog post:
https://aws.amazon.com/blogs/aws/new-manage-dynamodb-items-using-time-to-live-ttl/
You could use the timestamp as the range key which would be indexed and allow for easier operations based on the time.

Can reading a list from a disk be better than loading a dictionary?

I am building an application where I am trying to allow users to submit a list of company and date pairs and find out whether or not there was a news event on that date. The news events are stored in a dictionary with a company identifier and a date as a key.
newsDict('identifier','MM/DD/YYYY')=[list of news events for that date]
The dictionary turned out to be much larger than I thought-too big even to build it in memory so I broke it down into three pieces, each piece is limited to a particular range of company identifiers.
My plan was to take the user submitted list and using a dictionary group the user list of company identifiers to match the particular newsDict that the company events would be expected to be found and then load the newsDicts one after another to get the values.
Well now I am wondering if it would not be better to keep the news events in a list with each item of the list being a sublist list of a tuple and another list
[('identifier','MM/DD/YYYY'),[list of news events for that date]]
my thought then is that I would have a dictionary that would have the range of the list for each company identifier
companyDict['identifier']=(begofRangeinListforComp,endofRangeinListforComp)
I would use the user input to look up the ranges I needed and construct a list of the identifiers and ranges sorted by the ranges. Then I would just read the appropriate section of the list to get the data and construct the output.
The biggest reason I see for this is that even with the dictionary broken into thirds each section takes about two minutes to load on my machine and the dictionary ends up taking about 600 to 750 mb of ram.
I was surprised to note that a list of eight million lines took only about 15 seconds to load and used about 1/3 of the memory of the dictionary that had 1/3 the entries.
Further, since I can discard the lines in the list as I work through the list I will be freeing memory as I work down the user list.
I am surprised as I thought a dictionary would be the most efficient way to do this. but my poking at it suggests that the dictionary requires significantly more memory than a list. My reading of other posts on SO and elsewhere suggests that any other structure is going to require pointer allocations that are more expensive than list pointers. Am I missing something here and is there a better way to do this?
After reading Alberto's answer and response to my comment I spent some time trying to figure out how to write the function if I were to use a db. Now I might be hobbled here because I don't know much about db programming but
I think the code to implement using a db would be much more complicated than:
outList=[]
massiveFile=open('theFile','r')
for identifier in sortedUserList
# I get the list and sort it by the key of the dictionary
identifierList=massiveFile[theDict[identifier]['beginPosit']:theDict[identifier]['endPosit']+1]
for item in identifierList:
if item.startswith(manipulation of the identifier)
outList.append(item)
I have to wrap this in a function I didn't see anything that would be as comparably simple if I converted the list to a db.
Of course simpler was not the reason to bring me to this forum. I still don't see that using another structure will cost less memory. I have 30000 company identifiers and approximately 3600 dates. Each item in my list is an object in the parlance of OOD. That is where I am struggling I spent six hours this morning organizing the data for a dictionary before I gave up. Spending that amount of time to implement a database and then find that I am using half a gig or more of someone else's memory to load it seems problematic
With such a large amount of data, you should be using a database. This would be far better than looking at a list, and would be the most appropriate way of storing your data anyway. If you're using Python, it has SQLite built in I believe.
The dictionary will take more memory because it is effectively a hash.
You don't have to go so far as using a database, since your lookup requirements are so simple. Just use the file system.
Create a directory structure based on the company name (or ticker), with subdirectories for each date. To find whether data exists and load it up, just form the name of the subdirectory where the data would be, and see if it exists.
E.g., IBM news for May 21 would be in C:\db\IBM\20090521\news.txt, if in fact there were news for that day. You just check if the file exists; no searches.
If you want to try and boost speed from there, come up with a scheme to cache a limited amount of results that are likely to be frequently requested (assuming you're operating a server). For that, you'd use a hash.

Categories

Resources