Does UUIDField's 'default' attribute takes care of the uniqueness? - python

I just jumped into Django for a quick project and I figured there is a UUIDField in the models.
I am using this for an external id field that every model will have to expose the object. Will the default parameter handle the uniqueness or do I have to write it in the save? I mean I know there is practically no chance of values colliding, but just to know how it is done internally

How does UUID module guarantees unique values each time?
RFC 4122(UUID module specification) specifies three algorithms to generate UUIDs:
Using IEEE 802 MAC addresses as a source of uniqueness
Using pseudo-random numbers
Using well-known strings combined with cryptographic hashing
In all cases the seed value is combined with the system clock and a clock sequence value (to maintain uniqueness in case the clock was set backwards). As a result, the UUIDs generated according to the mechanisms above will be unique from all other UUIDs that have been or will be assigned.
Taken from RFC 4122 Abstract:
A UUID is 128 bits long, and can guarantee uniqueness across space and
time.
Note: Due to this uniqueness property of UUIDS, there is no check done by Django internally (as mentioned by #FlipperPA) to check if there already exists another object with the same uuid.

Django doesn't enforce the uniqueness of UUIDs. That's because the main use case for UUIDs is to provide an identifier that can be expected to be unique without having to check with a centralized authority (like a database, which is what unique=True does).
(Note that UUIDs are not guaranteed to be unique, there is just an astronomically small chance of a collision.)
You certainly can use the database to enforce uniqueness on top of the UUIDs if you want (by setting unique=True on your field), but I would say that's an unusual, and hard to justify, configuration.

No, it does not. Here is the relevant part of the code from Django:
def get_db_prep_value(self, value, connection, prepared=False):
if isinstance(value, six.string_types):
value = uuid.UUID(value.replace('-', ''))
if isinstance(value, uuid.UUID):
if connection.features.has_native_uuid_field:
return value
return value.hex
return value
As you can see, when preparing the value for the database, it simple calls uuid with a replace on hyphens; there's no check for existing uniqueness. That said, the UUIDField inherits from class Field, which will obey Django's models unique definition.

I am fond of using UUIDs as primary keys and I am fond of not delivering 500 errors to end users for a simple operation such as create a login. So I have the following classmethod in my model. I siphon off some pre-assigned reserved guids for synthetic transactions on the production database and don't want those colliding either. Cosmic lightning has struck before, a variant (instrumented to report collision) of the code below has actually fired the second attempt of guid assignment. The code shown below still risks a concurrent write collision from a different app server, so my views go back to this method if write/create operations in the view fail.
I do ack that this code is slower by the time cost of the db lookup, but as guid is my pk it is not ridiculously expensive when the underlying db uses a b-tree index on the field.
#classmethod
def attempt_to_set_guid(cls,attemptedGuid=None):
while(True):
try:
if attemptedGuid is None:
attemptedGuid = uuid4()
elif (attemptedGuid in cls.reserved_guids):
attemptedGuid = uuid4()
continue
alreadyExists = Guid.objects.get(guid=attemptedGuid)
break
except Exception as e:
break
return attemptedGuid

Related

Is there a way to error on related queries in Django ORM?

I have a Django model backed by a very large table (Log) containing millions of rows. This model has a foreign key reference to a much smaller table (Host). Example models:
class Host(Model):
name = CharField()
class Log(Model):
value = CharField()
host = ForeignKey(Host)
In reality there are many more fields and also more foreign keys similar to Log.host.
There is an iter_logs() function that efficiently iterates over Log records using a paginated query scheme. Other places in the program use iter_logs() to process large volumes of Log records, doing things like dumping to a file, etc.
For efficient operation, any code that uses iter_logs() should only access fields like value. But problems arise when someone innocently accesses log.host. In this case Django will issue a separate query each time a new Log record's host is accessed, killing the performance of the efficient paginated query in iter_logs().
I know I can use select_related to efficiently fetch the related host records, but all known uses of iter_logs() should not need this, so it would be wasteful. If a use case for accessing log.host did arise in this context I would want to add a parameter to iter_logs() to optionally use .select_related("host"), but that has not become necessary yet.
I am looking for a way to tell the underlying query logic in Django to never perform additional database queries except those explicitly allowed in iter_logs(). If such a query becomes necessary it should raise an error instead. Is there a way to do that with Django?
One solution I'd prefer to avoid: wrap or otherwise modify objects yielded by iter_logs() to prevent access to foreign keys.
More generally, Django's deferred query logic breaks encapsulation of code that constructs queries. Dependent code must know about the implementation or risk imposing major inefficiencies. This is usually fine at small scale where a little inefficiency does not matter, but becomes a real problem at larger scale. An early error would be much better because it would be easy to detect in small-scale tests rather than deferring the problem to production run time where it manifests as general slowness.

Auto increment property with py2neo (Neo4j)?

I'm using flask with py2neo for my Rest service , I have a user node with the label "User".
how to autoincrement id for the "User" label , in neo4j using py2neo?
You don't, and you probably shouldn't. Neo4j already provides an internal id field that is an auto-incrementing integer. It isn't a property of the node, but is accessible via the id() function, like this:
MATCH (n:Person)
RETURN id(n);
So whenever you create any node, this already happens automatically for free by neo4j, and isn't done by py2neo.
If you need a different type of identifier for your code, I'd recommend something that's plausibly globally unique, like a UUID which is very easy to do in python, rather than an auto-incrementing integer.
The trouble with auto-incrementing numbers as IDs is that since they have a pattern to them (auto-incrementing) people come to rely on the value of the identifier, or come to rely on expectations of how the ID will be assigned. This is almost always a bad idea in databases. The sole purpose of the identifier is to be unique from everything else. It doesn't mean anything, and in some cases isn't even guaranteed not to change. Avoid embedding any reliance on any particular value or assignment scheme into your code.
That's why I like UUIDs, is because their assignment scheme is essentially arbitrary, and they clearly don't mean anything -- so they don't tempt designers to do anything clever with them. :)

Google app engine: better way to make query

Say I have RootEntity, AEntity(child of RootEntity), BEntity(child of AEntity).
class RootEntity(ndb.Model):
rtp = ndb.StringProperty()
class AEntity(ndb.Model):
ap = ndb.IntegerProperty()
class BEntity(ndb.Model):
bp = ndb.StringProperty()
So in different handlers I need to get instances of BEntity with specific ancestor(instance of AEntity).
There is a my query: BEntity.query(ancestor = ndb.Key("RootEntity", 1, "AEntity", AEntity.query(ancestor = ndb.Key("RootEntity", 1)).filter(AEntity.ap == int(some_value)).get().key.integer_id()))
How I can to optimize this query? Make it better, may be less sophisticated?
Upd:
This query is a part of function with #ndb.transactional decorator.
You should not use Entity Groups to represent entity relationships.
Entity groups have a special purpose: to define the scope of transactions. They give you ability to update multiple entities transactionally, as long as they are a part of the same entity group (this limitation has been somewhat relaxed with the new XG transactions). They also allow you to use queries within transactions (not available via XG transactions).
The downside of entity groups is that they have an update limitation of 1 write/second.
In your case my suggestion would be to use separate entities and make references between them. The reference should be a Key of the referenced entity as this is type-safe.
Regarding query simplicity: GAE unfortunately does not support JOINs or reference (multi-entity) queries, so you would still need to combine multiple queries together (as you do now).
There is a give and take with ancestor queries. They are a more verbose and messy to deal with but you get a better structure to your data and consistency in your queries.
To simplify this, if your handler knows the BEntity you want to get, just pass around the key.urlsafe() encoded key, it already has all of your ancestor information encoded.
If this is not possible, try possibly restructuring your data. Since these objects are all of the same ancestor, they belong to the same entity group, thus at most you can insert/update ~1 time per second for objects in that entity group. If you require higher throughput or do not require consistent ancestral queries, then try using ndb.KeyProperty to link entities with a reference to a parent rather than as an ancestor. Then you'd only need to get a single parent to query on rather than the parent and the parent's parent.
You should also try and use IDs whenever possible, so you can avoid having to filter for entities in your datastore by properties and just reference them by ID:
BEntity.query(ancestor = ndb.Key("RootEntity", 1, "AEntity", int(some_value)))
Here, int(some_value) is the integer ID of the AEntity you used when you created that object. Just be sure that you can ensure the IDs you manually create/use will be unique across all instances of that Model that share the same parent.
EDIT:
To clarify, my last example should have been made more clear in that I was suggesting to restructure the data such that int(some_value) be used as the integer ID of the AEntity rather than storing is as a separate property of the Entity - if possible of course. From the example given, a query is performed for the AEntity objects that have a given integer field value of int(some_value) and executed with a get() - implying that you will always expect a single value return for that integer ID making it a good candidate to use as the integer ID for the key of that object eliminating the need for a query.

Using Python's UUID to generate unique IDs, should I still check for duplicates?

I'm using Python's UUID function to create unique IDs for objects to be stored in a database:
>>> import uuid
>>> print uuid.uuid4()
2eec67d5-450a-48d4-a92f-e387530b1b8b
Is it ok to assume that this is indeed a unique ID?
Or should I double-check that this unique ID has not already been generated against my database before accepting it as valid.
I would use uuid1, which has zero chance of collisions since it takes date/time into account when generating the UUID (unless you are generating a great number of UUID's at the same time).
You can actually reverse the UUID1 value to retrieve the original epoch time that was used to generate it.
uuid4 generates a random ID that has a very small chance of colliding with a previously generated value, however since it doesn't use monotonically increasing epoch time as an input (or include it in the output uuid), a value that was previously generated has a (very) small chance of being generated again in the future.
You should always have a duplicate check, even though the odds are pretty good, you can always have duplicates.
I would recommend just adding a duplicate key constraint in your database and in case of an error retry.
As long as you create all uuids on same system, unless there is a very serious flaw in python implementation (what I really cannot imagine), RFC 4122 states that they will all be distinct (edited : if using version 1,3 or 5).
The only problem that could arise with UUID, were if two systems create UUID exactly at the same moment and :
use same MAC address on their network card (really uncommon) and you are using UUID version 1
or use same name and you are using UUID version 3 or 5
or got same random number and you are using UUID version 4 (*)
So if you have a real MAC address or use an official DNS name or a unique LDAP DN, you can take for true that the generated uuids will be globally unique.
So IMHO, you only have to check unicity if you want to prevent your application against a malicious attack trying to voluntaryly use an existant uuid.
EDIT:
As stated by Martin Konecny, in uuid4 the timestamp part is random too and not monotonic. So the possibilily is collision is very limited but not 0.

sqlalchemy id equality vs reference equality

I'm working with SQLAlchemy for the first time and was wondering...generally speaking is it enough to rely on python's default equality semantics when working with SQLAlchemy vs id (primary key) equality?
In other projects I've worked on in the past using ORM technologies like Java's Hibernate, we'd always override .equals() to check for equality of an object's primary key/id, but when I look back I'm not sure this was always necessary.
In most if not all cases I can think of, you only ever had one reference to a given object with a given id. And that object was always the attached object so technically you'd be able to get away with reference equality.
Short question: Should I be overriding eq() and hash() for my business entities when using SQLAlchemy?
Short answer: No, unless you're working with multiple Session objects.
Longer answer, quoting the awesome documentation:
The ORM concept at work here is known as an identity map and ensures that all operations upon a particular row within a Session operate upon the same set of data. Once an object with a particular primary key is present in the Session, all SQL queries on that Session will always return the same Python object for that particular primary key; it also will raise an error if an attempt is made to place a second, already-persisted object with the same primary key within the session.
I had a few situations where my sqlalchemy application would load multiple instances of the same object (multithreading/ different sqlalchemy sessions ...). It was absolutely necessary to override eq() for those objects or I would get various problems. This could be a problem in my application design, but it probably doesn't hurt to override eq() just to be sure.

Categories

Resources