Dynamodb - query if a list contains

Dynamodb - query if a list contains - python

I'm fairly new to NoSQL. Using Python/Boto but this is a fairly general question. Currently trying to switch a project from MongoDB to DynamoDB and seeking some advice on DynamoDB and it's capacity to query if a list contains a certain string. I have been searching for the past day or so but I'm starting to worry that it doesn't have this facility, other than to use scan which is terribly slow considering the db will be queries thousands of times on updates. Similar unanswered question here
I understand primary keys can only be N, S or B and not something like String Set (SS) which would have been useful.
The data is fairly simple and would look something like this. I'm looking for the most efficient way to query the db based on the tag attribute for entries that include 'string1' OR 'string2'. Again, I don't want to use scan but am willing to consider normalization of the data structure if there is a best practice in dynamodb.
{
id: <some number used as a primary key>,
tags: ['string1', 'string2'...],
data: {some JSON object}
}
From what I've read, even using global secondary indexes, this doesn't seem possible which is strange since that would make dynamodb only useful for the most simple queries. Hoping I'm missing something.

In MongoDB, you have multikey indices, but not in DynamoDB.
I'd think you'd need to solve it like you would in a relational database: create a many-to-many relation table with tag as your hash key and entry id as your sort key. And find some way to keep your relation table in sync with your entry table.

Related

get result in dynamo db based on list of elements

I am new to dynamo db and want to compare values of a list(python) with attribute value of dynamo db table.
I am able to compare single value by using query with index key:
response = dynamotable.query(
IndexName='Classicmovies',
KeyConditionExpression = Key('DDT').eq('BBB-rrr-jjj-mq'))
but want to compare entire list which should be in .eq as follow:
movies =['ddd-dddss-gdgdg','kkdf-dfdfd-www','dfw-gddf-gssg']
I have searched alot and not able to figure out right way.

Hard to say what you are trying to do. A query will only retrieve a bunch of records belonging to a single item collection. Maybe what you need is a scan but please avoid heavily using scans unless of its for maintenance purposes.

Is it more efficient to store id values in dictionary or re-query database

I have a script that repopulates a large database and would generate id values from other tables when needed.
Example would be recording order information when given customer names only. I would check to see if the customer exists in a CUSTOMER table. If so, SELECT query to get his ID and insert the new record. Else I would create a new CUSTOMER entry and get the Last_Insert_Id().
Since these values duplicate a lot and I don't always need to generate a new ID -- Would it be better for me to store the ID => CUSTOMER relationship as a dictionary that gets checked before reaching the database or should I make the script constantly requery the database? I'm thinking the first approach is the best approach since it reduces load on the database, but I'm concerned for how large the ID Dictionary would get and the impacts of that.
The script is running on the same box as the database, so network delays are negligible.

"Is it more efficient"?
Well, a dictionary is storing the values in a hash table. This should be quite efficient for looking up a value.
The major downside is maintaining the dictionary. If you know the database is not going to be updated, then you can load it once and the in-application memory operations are probably going to be faster than anything you can do with a database.
However, if the data is changing, then you have a real challenge. How do you keep the memory version aligned with the database version? This can be very tricky.
My advice would be to keep the work in the database, using indexes for the dictionary key. This should be fast enough for your application. If you need to eke out further speed, then using a dictionary is one possibility -- but no doubt, one possibility out of many -- for improving the application performance.

Python: Dumping Database Data with Peewee

Background
I am looking for a way to dump the results of MySQL queries made with Python & Peewee to an excel file, including database column headers. I'd like the exported content to be laid out in a near-identical order to the columns in the database. Furthermore, I'd like a way for this to work across multiple similar databases that may have slightly differing fields. To clarify, one database may have a user table containing "User, PasswordHash, DOB, [...]", while another has "User, PasswordHash, Name, DOB, [...]".
The Problem
My primary problem is getting the column headers out in an ordered fashion. All attempts thus far have resulted in unordered results, and all of which are less then elegant.
Second, my methodology thus far has resulted in code which I'd (personally) hate to maintain, which I know is a bad sign.
Work so far
At present, I have used Peewee's pwiz.py script to generate the models for each of the preexisting database tables in the target databases, then went and entered all primary and foreign keys. The relations are setup, and some brief tests showed they're associating properly.
Code: I've managed to get the column headers out using something similar to:
for i, column in enumerate(User._meta.get_field_names()):
ws.cell(row=0,column=i).value = column
As mentioned, this is unordered. Also, doing it this way forces me to do something along the lines of
getattr(some_object, title)
to dynamically populate the fields accordingly.
Thoughts and Possible Solutions
Manually write out the order that I want stuff in an array, and use that for looping through and populating data. The pros of this is very strict/granular control. The cons are that I'd need to specify this for every database.
Create (whether manually or via a method) a hash of fields with an associated weighted value for all possibly encountered fields, then write a method for sorting "_meta.get_field_names()" according to weight. The cons of this is that the columns may not be 100% in the right order, such as Name coming before DOB in one DB, while after it in another.
Feel free to tell me I'm doing it all wrong or suggest completely different ways of doing this, I'm all ears. I'm very much new to Python and Peewee (ORMs in general, actually). I could switch back to Perl and do the database querying via DBI with little to no hassle. However, it's libraries for excel would cause me as many problems, and I'd like to take this as a time to expand my knowledge.

There is a method on the model meta you can use:
for field in User._meta.get_sorted_fields():
print field.name
This will print the field names in the order they are declared on the model.

Select list of Primary Key objects in SQLAlchemy

First off, this is my first project using SQLAlchemy, so I'm still fairly new.
I am making a system to work with GTFS data. I have a back end that seems to be able to query the data quite efficiently.
What I am trying to do though is allow for the GTFS files to update the database with new data. The problem that I am hitting is pretty obvious, if the data I'm trying to insert is already in the database, we have a conflict on the uniqueness of the primary keys.
For Efficiency reasons, I decided to use the following code for insertions, where model is the model object I would like to insert the data into, and data is a precomputed, cleaned list of dictionaries to insert.
for chunk in [data[i:i+chunk_size] for i in xrange(0, len(data), chunk_size)]:
engine.execute(model.__table__.insert(),chunk)
There are two solutions that come to mind.
I find a way to do the insert, such that if there is a collision, we don't care, and don't fail. I believe that the code above is using the TableClause, so I checked there first, hoping to find a suitable replacement, or flag, with no luck.
Before we perform the cleaning of the data, we get the list of primary key values, and if a given element matches on the primary keys, we skip cleaning and inserting the value. I found that I was able to get the PrimaryKeyConstraint from Table.primary_key, but I can't seem to get the Columns out, or find a way to query for only specific columns (in my case, the Primary Keys).
Either should be sufficient, if I can find a way to do it.
After looking into both of these for the last few hours, I can't seem to find either. I was hoping that someone might have done this previously, and point me in the right direction.
Thanks in advance for your help!
Update 1: There is a 3rd option I failed to mention above. That is to purge all the data from the database, and reinsert it. I would prefer not to do this, as even with small GTFS files, there are easily hundreds of thousands of elements to insert, and this seems to take about half an hour to perform, which means if this makes it to production, lots of downtime for updates.

With SQLAlchemy, you simply create a new instance of the model class, and merge it into the current session. SQLAlchemy will detect if it already knows about this object (from cache or the database) and will add a new row to the database if needed.
newentry = model(chunk)
session.merge(newentry)
Also see this question for context: Fastest way to insert object if it doesn't exist with SQLAlchemy

Searching across multiple tables (best practices)

I have property management application consisting of tables:
tenants
landlords
units
properties
vendors-contacts
Basically I want one search field to search them all rather than having to select which category I am searching. Would this be an acceptable solution (technology wise?)
Will searching across 5 tables be OK in the long run and not bog down the server? What's the best way of accomplishing this?
Using PostgreSQL

Why not create a view which is a union of the tables which aggregates the columns you want to search on into one, and then search on that aggregated column?
You could do something like this:
select 'tenants:' + ltrim(str(t.Id)), <shared fields> from Tenants as t union
select 'landlords:' + ltrim(str(l.Id)), <shared fields> from Tenants as l union
...
This requires some logic to be embedded from the client querying; it has to know how to fabricate the key that it's looking for in order to search on a single field.
That said, it's probably better if you just have a separate column which contains a "type" value (e.g. landlord, tenant) and then filter on both the type and the ID, as it will be computationally less expensive (and can be optimized better).

You want to use the built-in full text search or a separate product like Lucene. This is optimised for unstructured searches over heterogeneous data.
Also, don't forget that normal indices cannot be used for something LIKE '%...%'. Using a full text search engine will also be able to do efficient substring searches.

I would suggest using a specialized full-text indexing tool like Lucene for this. It will probably be easier to get up and running, and the result is faster and more featureful too. Postgres full text indexes will be useful if you also need structured search capability on top of this or transactionality of your search index is important.
If you do want to implement this in the database, something like the following scheme might work, assuming you use surrogate keys:
for each searchable table create a view that has the primary key column of that table, the name of the table and a concatenation of all the searchable fields in that table.
create a functional GIN or GiST index on the underlying over the to_tsvector() of the exact same concatenation.
create a UNION ALL over all the views to create the searchable view.
After that you can do the searches like this:
SELECT id, table_name, ts_rank_cd(body, query) AS rank
FROM search_view, to_tsquery('search&words') query
WHERE query ## body
ORDER BY rank DESC
LIMIT 10;

You should be fine, and there's really no other good (easy) way to do this. Just make sure the fields you are searching on are properly indexed though.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.