Are there any really good articles breaking down how to persist data into DynamoDB from Alexa? I can't seem to find a good article to break down step by step on how to persist a slot value into DynamoDB. I see in the Alexa docs here about implementing the code in Python, but that seems to be only part of what I'm looking for.
There's really no comprehensive breakdown of this, as like this tutorial, that persists data to S3. I would like to try to find something similar for DynamoDB. If there's an answer from a previous question that has answered it, let me know and I can mark it as a duplicate.
You can just use a tutorial which uses python and aws lambdas.
Like this one:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GettingStarted.Python.html
the amazon article is more about the development kit which can give you some nice features to store persistent attributes for a users.
so usually I have a persistent store for users (game scores, ..., last use of skill whatever) and additional data in an other table
The persistence adapter has an interface spec that abstracts away most of the details operationally. You should be able to change persistence adapters by initializing one that meets the spec, and in the initialization there may be some different configuration options. But the way you put things in and get them out should remain functionally the same.
You can find the configuration options for S3 and Dynamo here. https://developer.amazon.com/en-US/docs/alexa/alexa-skills-kit-sdk-for-python/manage-attributes.html
I have written a "local persistence adapter" in JavaScript to let me store values in flat files at localhost instead of on S3 when I'm doing local dev/debug. Swapping the two out (depending on environment) is all handled at adapter initialization. My handlers that use the attributes manager don't change.
Related
I have got a requirement to extract data from Amazon Aurora RDS instance and load it to S3 to make it a data lake for analytics purposes. There are multiple schemas/databases in one instance and each schema has a similar set of tables. I need to pull selective columns from these tables for all schemas in parallel. This should happen in real-time capturing the DML operations periodically.
There may arise the question of using dedicated services like Data Migration or Copy activity provided by AWS. But I can't use them since the plan is to make the solution cloud platform independent as it could be hosted on Azure down the line.
I was thinking Apache Spark could be used for this, but I got to know it doesn't support JDBC as a source in Structured streaming. I read about multi-threading and multiprocessing techniques in Python for this but have to assess if they are suitable (the idea is to run the code as daemon threads, each thread fetching data from the tables of a single schema in the background and they run continuously in defined cycles, say every 5 minutes). The data synchronization between RDS tables and S3 is also a crucial aspect to consider.
To talk more about the data in the source tables, they have an auto-increment ID field but are not sequential and might be missing a few numbers in between as a result of the removal of those rows due to the inactivity of the corresponding entity, say customers. It is not needed to pull the entire data of a record, only a few are pulled which would be been predefined in the configuration. The solution must be reliable, sustainable, and automatable.
Now, I'm quite confused to decide which approach to use and how to implement the solution once decided. Hence, I seek the help of people who dealt with or know of any solution to this problem statement. I'm happy to provide more info in case it is required to get to the right solution. Any help on this would be greatly appreciated.
For my app, I am using Flask, however the question I am asking is more general and can be applied to any Python web framework.
I am building a comparison website where I can update details about products in the database. I want to structure my app so that 99% of users who visit my website will never need to query the database, where information is instead retrieved from the cache (memcached or Redis).
I require my app to be realtime, so any update I make to the database must be instantly available to any visitor to the site. Therefore I do not want to cache views/routes/html.
I want to cache the entire database. However, because there are so many different variables when it comes to querying, I am not sure how to structure this. For example, if I were to cache every query and then later need to update a product in the database, I would basically need to flush the entire cache, which isn't ideal for a large web app.
I would prefer is to cache individual rows within the database. The problem is, how do I structure this so I can flush the cache appropriately when an update is made to the database? Also, how can I map all of this together from the cache?
I hope this makes sense.
I had this exact question myself, with a PHP project, though. My solution was to use ElasticSearch as an intermediate cache between the application and database.
The trick to this is the ORM. I designed it so that when Entity.save() is called it is first stored in the database, then the complete object (with all references) is pushed to ElasticSearch and only then the transaction is committed and the flow is returned back to the caller.
This way I maintained full functionality of a relational database (atomic changes, transactions, constraints, triggers, etc.) and still have all entities cached with all their references (parent and child relations) together with the ability to invalidate individual cached objects.
Hope this helps.
So a free eBook called "Redis in Action" by Josiah Carlson answered all of my questions. It it quite long, but after reading through, I have a fairly solid understanding of how to structure a caching architecture. It gives real world examples, such as a social network and a shopping site with tons of traffic. I will need to read through it again once or twice to fully understand. A great book!
Link: Redis in Action
Recently cloud dataflow python sdk was made available and I decided to use it. Unfortunately the support to read from cloud datastore is yet to come so I have to fall back on writing custom source so that I can utilize the benefits of dynamic splitting, progress estimation etc as promised. I did study the documentation thoroughly but am unable to put the pieces together so that I can speed up my entire process.
To be more clear my first approach was:
querying the cloud datastore
creating ParDo function and passing the returned query to it.
But with this it took 13 minutes to iterate over 200k entries.
So I decided to write custom source that would read the entities efficiently. But am unable to achieve that due to my lack of understanding of putting the pieces together. Can any one please help me with how to create custom source for reading from datastore.
Edited:
For first approach the link to my gist is:
https://gist.github.com/shriyanka/cbf30bbfbf277deed4bac0c526cf01f1
Thank you.
In the code you provided, the access to Datastore happens before the pipeline is even constructed:
query = client.query(kind='User').fetch()
This executes the whole query and reads all entities before the Beam SDK gets involved at all.
More precisely, fetch() returns a lazy iterable over the query results, and they get iterated over when you construct the pipeline, at beam.Create(query) - but, once again, this happens in your main program, before the pipeline starts. Most likely, this is what's taking 13 minutes, rather than the pipeline itself (but please feel free to provide a job ID so we can take a deeper look). You can verify this by making a small change to your code:
query = list(client.query(kind='User').fetch())
However, I think your intention was to both read and process the entities in parallel.
For Cloud Datastore in particular, the custom source API is not the best choice to do that. The reason is that the underlying Cloud Datastore API itself does not currently provide the properties necessary to implement the custom source "goodies" such as progress estimation and dynamic splitting, because its querying API is very generic (unlike, say, Cloud Bigtable, which always returns results ordered by key, so e.g. you can estimate progress by looking at the current key).
We are currently rewriting the Java Cloud Datastore connector to use a different approach, which uses a ParDo to split the query and a ParDo to read each of the sub-queries. Please see this pull request for details.
Setting up a data warehousing mining project on a Linux cloud server. The primary language is Python .
Would like to use this pattern for querying on data and storing data:
SQL Database - SQL database is used to query on data. However, the SQL database stores only fields that need to be searched on, it does NOT store the "blob" of data itself. Instead it stores a key that references that full "blob" of data in the a key-value Blobstore.
Blobstore - A key-value Blobstore is used to store actual "documents" or "blobs" of data.
The issue that we are having is that we would like more frequently accessed blobs of data to be automatically stored in RAM. We were planning to use Redis for this. However, we would like a solution that automatically tries to get the data out of RAM first, if it can't find it there, then it goes to the blobstore.
Is there a good library or ready-made solution for this that we can use without rolling our own? Also, any comments and criticisms about the proposed architecture would also be appreciated.
Thanks so much!
Rather than using Redis or Memcached for caching, plus a "blobstore" package to store things on disk, I would suggest to have a look at Couchbase Server which does exactly what you want (i.e. serving hot blobs from memory, but still storing them to disk).
In the company I work for, we commonly use the pattern you described (i.e. indexing in a relational database, plus blob storage) for our archiving servers (terabytes of data). It works well when the I/O done to write the blobs are kept sequential. The blobs are never rewritten, but simply appended at the end of a file (it is fine for an archiving application).
The same approach has been also used by others. For instance:
Bitcask (used in Riak): http://downloads.basho.com/papers/bitcask-intro.pdf
Eblob (used in Elliptics project): http://doc.ioremap.net/eblob:eblob
Any SQL database will work for the first part. The Blobstore could also be obtained, essentially, "off the shelf" by using cbfs. This is a new project, built on top of couchbase 2.0, but it seems to be in pretty active development.
CouchBase already tries to serve results out of RAM cache before checking disk, and is fully distributed to support large data sets.
CBFS puts a filesystem on top of that, and already there is a FUSE module written for it.
Since fileststems are effectively the lowest-common-denominator, it should be really easy for you to access it from python, and would reduce the amount of custom code you need to write.
Blog post:
http://dustin.github.com/2012/09/27/cbfs.html
Project Repository:
https://github.com/couchbaselabs/cbfs
I am running a webapp on google appengine with python and my app lets users post topics and respond to them and the website is basically a collection of these posts categorized onto different pages.
Now I only have around 200 posts and 30 visitors a day right now but that is already taking up nearly 20% of my reads and 10% of my writes with the datastore. I am wondering if it is more efficient to use the google app engine's built in get_by_id() function to retrieve posts by their IDs or if it is better to build my own. For some of the queries I will simply have to use GQL or the built in query language because they are retrieved on more than just and ID but I wanted to see which was better.
Thanks!
Are you doing efficient caching? (or any caching at all).
Also, if you're using that many writes for 300 posts, seems like you might have a problem with your models. Have you looked at the Datastore viewer to seem how many writes you use per entity?
You might read the docs on Exploding indexes, maybe that's part of your problem?
It's way better to use get_by_id(). It finds the exact object, and costs way less (counts as a query with only one entity).
I'd suggest using pre-existing code and building around that in stead of re-inventing the wheel.