I'm trying to move from redis to dynamoDB and sofar everything is working great! The only thing I have yet to figure out is key expiration. Currently, I have my data setup with one primary key and no range key as so:
{
"key" => string,
"value" => ["string", "string"],
"timestamp" => seconds since epoch
}
What I was thinking was to do a scan over the database for where timestamp is less than a particular value, and then explicitly delete them. This, however, seems extremely inefficient and would use up a ridiculous number of read/write units for no reason! On top of which, the expirations would only happen when I run the scan, so they could conceivably build up.
So, has anyone found a good solution to this problem?
I'm also using DynamoDB like the way we used to use Redis.
My suggestion is to write the key into different time-sliced tables.
For example, say a type of record should last few minutes, at most less an hour, then you can
Create a new table every day for this type of record and store new records in today's table.
Use a read repair tip when you read records, which means if you can't find a record in today's table, you try to find it in yesterday's table and put in today's table if necessary.
If you find the record in either table, verify it with it's timestamp. It's not necessary to delete expired records at this moment.
Drop entire stale tables in your tasks.
This is easier to maintain and cost-efficient.
You could do lazy expiration and delete it on request.
For example:
store key "a" with an attribute "expiration", expires in 10 minutes.
fetch in 9 minutes, check expiration, return it.
fetch in 11 minutes. check expiration. since it's less than now, delete the entry.
This is what memcached was doing when I looked at the source a few years ago.
You'd still need to do a scan to remove all the old entries.
You could also consider using Elasticache, which is meant for caching rather than a permanent data store.
It seems that Amazon just added expiration support to DynamoDB (as of feb 27 2017). Take a look at the official blog post:
https://aws.amazon.com/blogs/aws/new-manage-dynamodb-items-using-time-to-live-ttl/
You could use the timestamp as the range key which would be indexed and allow for easier operations based on the time.
Related
I'm relatively new in MongoDB- I've done stuff in it before, but my current project involves using collections to store values per "key". In this case- a key is referring to a string of characters that will be used to access my software. The authentication and key generation will be done on my website using Flask as the backend, which means I can use Python to handle all the key generation and authentication stuff. I have the code complete for the most part, it's able to generate and authenticate keys amazingly, and I'm really happy with how it works. Now the problem I face is getting the collection or key to automatically delete after 3 days.
The reason I want them to delete after 3 days is because the keys aren't lifetime keys. The software is a free software, but in order to use it you must have a key. That key should expire after a certain amount of time (in this case, 3 days) and the user must go back and get another one.
Please note that I can't use invidual documents, as one I've already set it up to use collections and two it needs to store multiple documents as compared to one document.
I've already tried TTL on the collection but it doesn't seem to be working.
What's the best way to do this? Keep in mind the collection name is the key itself so it can't have a date of deletion in it (a date that another code scans and when that date is met the collection is deleted).
I am trying to get the number of items in a table from dynamo db.
Code
def urlfn():
if request.method == 'GET':
print("GET REq processing")
return render_template('index.html',count = table.item_count)
But I am not getting the real count. I found that there is a 6 hour delay in getting the real count. Is there any way to get the real count of items in a table.
assuming in your code above that table is a service resource already defined, you can use:
len(table.scan())
this will give you an up to date count of items in your table. BUT it reads every single item in your table - for significantly large tables this can take a long time. AND it uses read capacity on your table to do so. So, for most practical purposes it really isn't a very good way to do so,
Depending on your use case here there are a few other options:
add a meta item that is updated everytime a new document is added to the dynamo. This is just document of whatever hash key / sort key combination you want with an attribute of "value" that you add 1 to every time you add a new item to the database.
you forget about using Dynamo. Sorry that sounds harsh, but DynamoDB is a no-sql database and attempting to use it in the same manner as a traditional relational database system is folly. # of 'rows' is not something that Dynamo is designed for because thats not its use case scope. There are no rows in Dynamo - there are documents, and those documents are partitioned, and you access small chunks of them at a time - meaning that the back end architecture does not lend itself for knowing what the entire system has in it at any given time (hence the 6 hour delay)
I can't find anyway to setup TTL on a document within AWS Elasticsearch utilizing python elasticsearch library.
I looked at the code of the library itself, and there are no argument for it, and I yet to see any answers on google.
There is none, you can use the index management policy if you like, which will operate at the index level, not at the doc level. You have a bit of wriggle room though in that you can create a pattern data-* and have more than 1 index, data-expiring-2020-..., data-keep-me.
You can apply a template to the pattern data-expiring-* and set a transition to delete an index after lets say 20 days. If you roll over to a new index each day you will the oldest day being deleted at the end of the day once it is over 20 days.
This method is much more preferable because if you are deleting individual documents that could consume large amounts of your cluster's capacity, as opposed to deleting entire shards. Other NoSQL databases such as DynamoDB operate in a similar fashion, often what you can do is add another field to your docs such as deletionDate and add that to your query to filter out docs which are marked for deletion, but are still alive in your index as a deletion job has not yet cleaned them up. That is how the TTL in DynamoDB behaves as well, data is not deleted the moment the TTL expires it, but rather in batches to improve performance.
I have a number of python processes that each repetitively query a separate betting API. The requests come in bursts of ~20-100 all at once, then the process goes away to parse the responses and repeats a second later or so. I am hoping to use Cassandra as my raw storage for requests and responses. This will allow me to debug problems with the parsed data and/or re-parse later. I am trying to devise a schema for this.
I am thinking I can have a separate table (column family) per API, not much to say about that. My initial idea for the table schema was:
stripe text, // free text to describe the flavour of the request, e.g. live games
date int, // YYYYMMDD
requests map<datetime, text>,
responses map<datetime, text>
I could then append the requests and responses to the correct row as they happened and end up with a row per day of timewise sorted requests and responses. I could then easily go back and find data for a given day (which seems like a reasonable chunk to process at a time), then go to a specific point in time on the day if required.
The problem here is obvious, 2 requests made at exactly the same time given my timestamping resolution, will end up overwriting one another. As unlikely as it might be it is wrong.
I then went on to a second idea I didn't really like, disambiguate the key using the timestamp and a hash of the request, assuming that the same request at the same time should return the same result and therefore be unique enough, ie str(timestamp) + str(hash(request)), meaning the schema becomes (datetime becomes text)
stripe text, // free text to describe the flavour of the request, e.g. live games
date int, // YYYYMMDD
requests map<text, text>,
responses map<text, text>
This sucks because text takes more space and is slower to compare but I was willing to accept it, then I hit this problem:
E InvalidRequest: code=2200 [Invalid query] message="Map value is too long. Map values are limited to 65535 bytes but 435145 bytes value provided"
This is basically telling me I can't ever put these things in a collection column anyway as responses are of arbitrary size and almost always bigger than the limit.
I am new in the Cassandra world but thought that these CQL maps end up corresponding to separate column names and values in the record, and that each column has a size limit of 2GB. One thing I can think of is to not use a map and keep altering the table schema every time, then inserting a normal value into the cell but I am not sure how that is different in the underlying store.
So I guess I have 2 questions:
Is this just a limitation of CQL or all of Cassandra?
Can someone more experienced think of an overall better approach?
Thanks for reading
KCH
To answer my own question - my misunderstanding lies in the fact that in CQL, the first part of the key is always the row key so for a composite key the rest of the parts of the key form the column key. Maps also end up in separate columns of the same row using their own key name "unrolling" convention but applying the limit size.
I use python, ndb and the datastore. My model ("Event") has a property:
created = ndb.DateTimeProperty(auto_now_add=True).
Events gets saved every now and then, sometimes several within one second.
I want to "poll for new events", without getting the same Event twice, and get an empty result if there aren't any new Events. However, polling again might give me new events.
I have seen Cursors, - but I don't know if they can be used somehow to poll for new Events, after having reached the end if the first query? The "next_cursor" is None when I've reached the (current) end of the data.
Keeping the last received "created" DateTime-property and use that for getting the next batch works, but that's only using a resolution of seconds, so the ordering might get screwed up..
Must I create my own transactional, incrementing counter in Event for this?
Yes, using cursors is a valid option. Even tho this link is from the Java documentation, it's valid for python also. The second paragraph is what you are looking for:
An interesting application of cursors is to monitor entities for unseen changes. If the app sets a timestamp property with the current date and time every time an entity changes, the app can use a query sorted by the timestamp property, ascending, with a Datastore cursor to check when entities are moved to the end of the result list. If an entity's timestamp is updated, the query with the cursor returns the updated entity. If no entities were updated since the last time the query was performed, no results are returned, and the cursor does not move.
EDIT: Prospective search has been shut down on December 1, 2015
Rather than polling an alternate approach would be to use prospective search
https://cloud.google.com/appengine/docs/python/prospectivesearch/
From the docs
"Prospective search is a querying service that allows your application
to match search queries against real-time data streams. For every
document presented, prospective search returns the ID of every
registered query that matches the document."