I have a relatively simple question; however I am very curious what the convention is and what the reason is for such a convention. The database is PostgreSQL and programming language that I am using is Python, but this does not lie at the core of my question.
Suppose we have the following JSON datastructure, which we still need to parse.
{
"harry": {
"transactions": {
"desc": ["fish", "drinks", "potatoes"],
"amount": [32, 12, 35]
},
"country": "UK"
},
"james": {
"transactions": {
"desc": ["computer", "water", "table", "phone"],
"amount": [100, 32, 59, 99]
},
"country": "China"
}
}
and we would like to put this in a PostgreSQL database. I am inclined to create UUIDs for the persons "harry" and "james" and also some UUIDs for their transactions and then insert them into the database. Resulting in three tables: personal_info, trans and pi_trans (which links the two tables).
However, one could also argue that I let the PostgreSQL database generate an identifier (that increases by 1 after every insert) and then I populate the pi_trans table with identifiers I retrieve from PostgreSQL.
I think perhaps the latter approach is way too slow, but other than that, I do not see any other clear reason why we should not do the second appraoch.
In addition, the second approach does allow for unique identifiers if we insert new records, while the former approach -can- have id collisions (although I suspect that with UUIDs the chance is really small).
Could someone help me out figuring what approach one should use and when?
Both identity columns (sequence generated values) and UUIDs will work for you. The collision probability for UUIDs is so small that you can safely bet on it (it is more likely that cosmic rays hit your memory and corrupt your sequence in a way that it returns duplicate values, or indeed that your database and all its backups are destroyed by a solar flare).
You can read my ruminations on the subject sequences vs. UUIDs here; my opinion is that sequences are usually preferable.
If you use INSERT ... RETURNING id, you can retrieve the auto-generated identifier, and you can reuse it with CTEs:
WITH pi AS (
INSERT INTO personal_info (...) VALUES (...)
RETURNING id
), t AS (
INSERT INTO trans (...) VALUES (...)
RETURNING id
)
INSERT INTO pi_trans (pi_id, trans_id)
SELECT pi.id, t.id
FROM pi CROSS JOIN t;
So I have a DynamoDB database table which looks like this (exported to csv):
"email (S)","created_at (N)","firstName (S)","ip_addresses (L)","lastName (S)","updated_at (N)"
"name#email","1628546958.837838381","ddd","[ { ""M"" : { ""expiration"" : { ""N"" : ""1628806158"" }, ""IP"" : { ""S"" : ""127.0.0.1"" } } }]","ddd","1628546958.837940533"
I want to be able to do a "query" not a "scan" for all of the IP's (attribute attached to users) which are expired. The time is stored in unix time.
Right now I'm scanning the entire table and looking through each user, one by one and then I loop through all of their IPs to see if they are expired or not. But I need to do this using a query, scans are expensive.
The table layout is like this:
primaryKey = email
attributes = firstName, lastName, ip_addresses (array of {} maps where each map has IP, and Expiration as two keys).
I have no idea how to do this using a query so I would greatly appreciate if anyone could show me how! :)
I'm currently running the scan using python and boto3 like this:
response = client.scan(
TableName='users',
Select='SPECIFIC_ATTRIBUTES',
AttributesToGet=[
'ip_addresses',
])
As per the boto3 documentation, The Query operation finds items based on primary key values. You can query any table or secondary index that has a composite primary key (a partition key and a sort key).
Use the KeyConditionExpression parameter to provide a specific value for the partition key. The Query operation will return all of the items from the table or index with that partition key value. You can optionally narrow the scope of the Query operation by specifying a sort key value and a comparison operator in KeyConditionExpression . To further refine the Query results, you can optionally provide a FilterExpression . A FilterExpression determines which items within the results should be returned to you. All of the other results are discarded.
So long story short, it will only work to fetch a particular row whose primary key you have mentioned while running query.
A Query operation always returns a result set. If no matching items are found, the result set will be empt
I am working on a system where I am storing data in DynamoDB and it has to be sorted chronologically. For partition_key I have an id (uuid) and for sort_key I have a date_created value. Now originally it was enough to save unique entries using only the ID, but then a problem arose that this data was not being sorted as I wanted, so a sort_key was added.
Using python boto3 library, it would be enough for me to get, update or delete items using only the id primary key since I know that it is always unique:
import boto3
resource = boto3.resource('dynamodb')
table = resource.Table('my_table_name')
table.get_item(
Key={'item_id': 'unique_item_id'}
)
table.update_item(
Key={'item_id': 'unique_item_id'}
)
table.delete_item(
Key={'item_id': 'unique_item_id'}
)
However, DynamoDB requires a sort key to be provided as well, since primary keys are composed partition key and sort key.
table.get_item(
Key={
'item_id': 'unique_item_id',
'date_created': 12345 # timestamp
}
)
First of all, is it the right approach to use sort key to sort data chronologically or are there better approaches?
Secondly, what would be the best approach for transmitting partition key and sort key across the system? For example I have an API endpoint which accepts the ID, by this ID the backend performs a get_item query and returns the corresponding data. Now since I also need the sort key, I was thinking about using a hashing algorithm internally, where I would hash a JSON like this:
{
"item_id": "unique_item_id",
"date_created": 12345
}
and a single value then becomes my identifier for this database entry. I would then dehash this value before performing any database queries. Is this the approach common?
First of all, is it the right approach to use sort key to sort data chronologically
Sort keys are the means of sorting data in DynamoDB. Using a timestamp as a sort key field is the right thing to do, and a common pattern in DDB.
DynamoDB requires a sort key to be provided ... since primary keys are composed partition key and sort key.
This is true. However, when reading from DDB it is possible to specify only the partition key using the query operation (as opposed to theget_item operation which requires the full primary key). This is a powerful construct that lets you specify which items you want to read from a given partition.
You may want to look into KSUIDs for your unique identifiers. KSUIDs are like UUIDs, but they contain a time component. This allows them to be sorted by generation time. There are several KSUID libraries in python, so you don't need to implement the algorithm yourself.
I am using Pymongo to access Mongo db. I want to search for all people nearby a specified location with name contains a string. For example, I want to search all people nearby [105.0133, 21.3434] and name contains 'Mark'. So I write the query like this:
db.users.find({ "location.coords": { "$nearSphere": [105.0133, 21.3434], "$maxDistance": 10/EARTH_RADIUS }, "name": "/Mark/" })
(I have an index "location.coords" in my "users" collection)
The query works fine in Mongodb console, but while execute by Pymongo, the dictionary being re-sort like this:
{ "name": "/Mark/", "location.coords": { "$nearSphere": [105.0133, 21.3434], "$maxDistance": 10/EARTH_RADIUS } }
(The "name" key is before "location.coords", that is not what I expected - also Mongodb expected)
That causes Mongodb cannot understand the query and returns no results. Can anyone help me to figure out how to force the Pymongo does not re-sort my dictionary.
Thanks and regards
The dictionary type is inherently orderless. From the python documentation:
It is best to think of a dictionary as an unordered set of key: value
pairs, with the requirement that the keys are unique (within one
dictionary).
If you want to index your dictionary in a specific order, you'll have to store your order somehow. One easy way to do this is to keep your keys in a list, like:
mongo_keys = ["location.coords", "name"]
for k in mongo_keys:
do_something(mongo_result[k])
You also might want to investigate:
class collections.OrderedDict([items])
Return an instance of a dict
subclass, supporting the usual dict methods. An OrderedDict is a dict
that remembers the order that keys were first inserted. If a new entry
overwrites an existing entry, the original insertion position is left
unchanged. Deleting an entry and reinserting it will move it to the
end.
Unfortunately if you need more help than that, you'll need to provide more details of your situation.
The issue isn't the ordering, it's "/Mark/". The notation with forward slashes is a convenience provided by the javascript shell, and don't constitute a part of the regular expression pattern itself (unless you meant for them to be literal slashes, in which case I've misunderstood your question).
To use a regular expression ("contains") filter in PyMongo, you need to pass a Python regular expression object. Try this:
{ "name": re.compile("Mark"), "location.coords": { "$nearSphere": [105.0133, 21.3434], "$maxDistance": 10/EARTH_RADIUS } }
Every day, I receive a stock of documents (an update). What I want to do is insert each item that does not already exist.
I also want to keep track of the first time I inserted them, and the last time I saw them in an update.
I don't want to have duplicate documents.
I don't want to remove a document which has previously been saved, but is not in my update.
95% (estimated) of the records are unmodified from day to day.
I am using the Python driver (pymongo).
What I currently do is (pseudo-code):
for each document in update:
existing_document = collection.find_one(document)
if not existing_document:
document['insertion_date'] = now
else:
document = existing_document
document['last_update_date'] = now
my_collection.save(document)
My problem is that it is very slow (40 mins for less than 100 000 records, and I have millions of them in the update).
I am pretty sure there is something builtin for doing this, but the document for update() is mmmhhh.... a bit terse.... (http://www.mongodb.org/display/DOCS/Updating )
Can someone advise how to do it faster?
Sounds like you want to do an upsert. MongoDB has built-in support for this. Pass an extra parameter to your update() call: {upsert:true}. For example:
key = {'key':'value'}
data = {'key2':'value2', 'key3':'value3'};
coll.update(key, data, upsert=True); #In python upsert must be passed as a keyword argument
This replaces your if-find-else-update block entirely. It will insert if the key doesn't exist and will update if it does.
Before:
{"key":"value", "key2":"Ohai."}
After:
{"key":"value", "key2":"value2", "key3":"value3"}
You can also specify what data you want to write:
data = {"$set":{"key2":"value2"}}
Now your selected document will update the value of key2 only and leave everything else untouched.
As of MongoDB 2.4, you can use $setOnInsert (http://docs.mongodb.org/manual/reference/operator/setOnInsert/)
Set insertion_date using $setOnInsert and last_update_date using $set in your upsert command.
To turn your pseudocode into a working example:
now = datetime.utcnow()
for document in update:
collection.update_one(
filter={
'_id': document['_id'],
},
update={
'$setOnInsert': {
'insertion_date': now,
},
'$set': {
'last_update_date': now,
},
},
upsert=True,
)
You could always make a unique index, which causes MongoDB to reject a conflicting save. Consider the following done using the mongodb shell:
> db.getCollection("test").insert ({a:1, b:2, c:3})
> db.getCollection("test").find()
{ "_id" : ObjectId("50c8e35adde18a44f284e7ac"), "a" : 1, "b" : 2, "c" : 3 }
> db.getCollection("test").ensureIndex ({"a" : 1}, {unique: true})
> db.getCollection("test").insert({a:2, b:12, c:13}) # This works
> db.getCollection("test").insert({a:1, b:12, c:13}) # This fails
E11000 duplicate key error index: foo.test.$a_1 dup key: { : 1.0 }
You may use Upsert with $setOnInsert operator.
db.Table.update({noExist: true}, {"$setOnInsert": {xxxYourDocumentxxx}}, {upsert: true})
Summary
You have an existing collection of records.
You have a set records that contain updates to the existing records.
Some of the updates don't really update anything, they duplicate what you have already.
All updates contain the same fields that are there already, just possibly different values.
You want to track when a record was last changed, where a value actually changed.
Note, I'm presuming PyMongo, change to suit your language of choice.
Instructions:
Create the collection with an index with unique=true so you don't get duplicate records.
Iterate over your input records, creating batches of them of 15,000 records or so. For each record in the batch, create a dict consisting of the data you want to insert, presuming each one is going to be a new record. Add the 'created' and 'updated' timestamps to these. Issue this as a batch insert command with the 'ContinueOnError' flag=true, so the insert of everything else happens even if there's a duplicate key in there (which it sounds like there will be). THIS WILL HAPPEN VERY FAST. Bulk inserts rock, I've gotten 15k/second performance levels. Further notes on ContinueOnError, see http://docs.mongodb.org/manual/core/write-operations/
Record inserts happen VERY fast, so you'll be done with those inserts in no time. Now, it's time to update the relevant records. Do this with a batch retrieval, much faster than one at a time.
Iterate over all your input records again, creating batches of 15K or so. Extract out the keys (best if there's one key, but can't be helped if there isn't). Retrieve this bunch of records from Mongo with a db.collectionNameBlah.find({ field : { $in : [ 1, 2,3 ...}) query. For each of these records, determine if there's an update, and if so, issue the update, including updating the 'updated' timestamp.
Unfortunately, we should note, MongoDB 2.4 and below do NOT include a bulk update operation. They're working on that.
Key Optimization Points:
The inserts will vastly speed up your operations in bulk.
Retrieving records en masse will speed things up, too.
Individual updates are the only possible route now, but 10Gen is working on it. Presumably, this will be in 2.6, though I'm not sure if it will be finished by then, there's a lot of stuff to do (I've been following their Jira system).
I don't think mongodb supports this type of selective upserting. I have the same problem as LeMiz, and using update(criteria, newObj, upsert, multi) doesn't work right when dealing with both a 'created' and 'updated' timestamp. Given the following upsert statement:
update( { "name": "abc" },
{ $set: { "created": "2010-07-14 11:11:11",
"updated": "2010-07-14 11:11:11" }},
true, true )
Scenario #1 - document with 'name' of 'abc' does not exist:
New document is created with 'name' = 'abc', 'created' = 2010-07-14 11:11:11, and 'updated' = 2010-07-14 11:11:11.
Scenario #2 - document with 'name' of 'abc' already exists with the following:
'name' = 'abc', 'created' = 2010-07-12 09:09:09, and 'updated' = 2010-07-13 10:10:10.
After the upsert, the document would now be the same as the result in scenario #1. There's no way to specify in an upsert which fields be set if inserting, and which fields be left alone if updating.
My solution was to create a unique index on the critera fields, perform an insert, and immediately afterward perform an update just on the 'updated' field.
1. Use Update.
Drawing from Van Nguyen's answer above, use update instead of save. This gives you access to the upsert option.
NOTE: This method overrides the entire document when found (From the docs)
var conditions = { name: 'borne' } , update = { $inc: { visits: 1 }} , options = { multi: true };
Model.update(conditions, update, options, callback);
function callback (err, numAffected) { // numAffected is the number of updated documents })
1.a. Use $set
If you want to update a selection of the document, but not the whole thing, you can use the $set method with update. (again, From the docs)...
So, if you want to set...
var query = { name: 'borne' }; Model.update(query, ***{ name: 'jason borne' }***, options, callback)
Send it as...
Model.update(query, ***{ $set: { name: 'jason borne' }}***, options, callback)
This helps prevent accidentally overwriting all of your document(s) with { name: 'jason borne' }.
In general, using update is better in MongoDB as it will just create the document if it doesn't exist yet, though I'm not sure how to work that with your python adapter.
Second, if you only need to know whether or not that document exists, count() which returns only a number will be a better option than find_one which supposedly transfer the whole document from your MongoDB causing unnecessary traffic.
Method For Pymongo
The Official MongoDB Driver for Python
5% of the times you may want to update and overwrite, while other times you like to insert a new row, this is done with updateOne and upsert
95% (estimated) of the records are unmodified from day to day.
The following solution is taken from this core mongoDB function:
db.collection.updateOne(filter, update, options)
Updates a single document within the collection based on the filter.
This is done with this Pymongo's function update_one(filter, new_values, upsert=True)
Code Example:
# importing pymongo's MongoClient
from pymongo import MongoClient
conn = MongoClient('localhost', 27017)
db = conn.databaseName
# Filter by appliances called laptops
filter = { 'user_id': '4142480', 'question_id': '2801008' }
# Update number of laptops to
new_values = { "$set": { 'votes': 1400 } }
# Using update_one() method for single update with upsert.
db.collectionName.update_one(filter, new_values, upsert=True)
What upsert=True Do?
Creates a new document if no documents match the filter.
Updates a single document that matches the filter.
I do propose the using of await now.