I see quit a few implementations of unique string generation for things like uploaded image names, session IDs, et al, and many of them employ the usage of hashes like SHA1, or others.
I'm not questioning the legitimacy of using custom methods like this, but rather just the reason. If I want a unique string, I just say this:
>>> import uuid
>>> uuid.uuid4()
UUID('07033084-5cfd-4812-90a4-e4d24ffb6e3d')
And I'm done with it. I wasn't very trusting before I read up on uuid, so I did this:
>>> import uuid
>>> s = set()
>>> for i in range(5000000): # That's 5 million!
>>> s.add(str(uuid.uuid4()))
...
...
>>> len(s)
5000000
Not one repeater (I wouldn't expect one now considering the odds are like 1.108e+50, but it's comforting to see it in action). You could even half the odds by just making your string by combining 2 uuid4()s.
So, with that said, why do people spend time on random() and other stuff for unique strings, etc? Is there an important security issue or other regarding uuid?
Using a hash to uniquely identify a resource allows you to generate a 'unique' reference from the object. For instance, Git uses SHA hashing to make a unique hash that represents the exact changeset of a single a commit. Since hashing is deterministic, you'll get the same hash for the same file every time.
Two people across the world could make the same change to the same repo independently, and Git would know they made the same change. UUID v1, v2, and v4 can't support that since they have no relation to the file or the file's contents.
Well, sometimes you want collisions. If someone uploads the same exact image twice, maybe you'd rather tell them it's a duplicate rather than just make another copy with a new name.
One possible reason is that you want the unique string to be human-readable. UUIDs just aren't easy to read.
uuids are long, and meaningless (for instance, if you order by uuid, you get a meaningless result).
And, because it's too long, I wouldn't want to put it in a URL or expose it to the user in any shape or form.
In addition to the other answers, hashes are really good for things that should be immutable. The name is unique and can be used to check the integrity of whatever it is attached to at any time.
Also note other kinds of UUID could even be appropriate. For example, if you want your identifier to be orderable, UUID1 is based in part on a timestamp. It's all really about your application requirements...
Related
My idea is to create a hash of a queryset result. For example, product inventory.
Each update of this stock would generate a hash.
This use would be intended to only request this queryset in the API, when there is a change (example: a new product in invetory).
Example for this use:
no change, same hash - no request to get queryset
there was change, different hash. Then a request will be made.
This would be a feature designed for those who are consuming the data and not for the Django that is serving.
Does this make any sense? I saw that in python there is a way to generate a hash from a tuple, in my case it would be to use the frozenset and generate the hash. I don't know if it's a good idea.
I would comment, but I'm waiting on the 50 rep to be able to do that. It sounds like you're trying to cache results so you aren't querying on data that hasn't been changed. If you're not familiar with caching, the idea is to save hard-to-compute answers in memory for frequently queried endpoints/functions.
For example, if I had a program that calculated the first n digits of pi, I may choose to save a map of [digit count -> value] so that if 10 people asked me for the first thousand, I would only calculate it once. Redis is a popular option for caching, and I believe it exists for Django. It allows you to cache some information, set a time before expiration on it, and then wipe specific parts of that information (to force it to recalculate) every time something specific changes (like a new product in inventory).
Everybody should try writing their own cache at least once, like what you're describing, but the de facto professional option is to use a caching library. Your idea is good, it will definitely work, and you will probably want a dict of [hash->result] for each hash, where result is the information you would send back over your API. If you plan to save data so it persists across multiple program starts, remember Python forces random seeds for hashes, causing inconsistent values. Check out this post for more info.
I'm trying to reduce the size of a string like this:
'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpYXQiOjE0NDU0OTk3NDUsImQiOnsiYXV0aF9kYXRhIjoiZm9vIiwib3RoZXJfYXV0aF9kYXRhIjoiYmFyIiwidWlkIjoidW5pcXVlSWQxIn0sInYiOjB9.h6LV3boj0ka2PsyOjZJb8Q48ugiHlEkNksusRGtcUBk'
to something that someone could type in less then 30 seconds like this:
'aF9kYX'
and be able to turn it back to the original string too. How could I achieve that?
EDIT: I guess I'm not being clear, first I don't know if what I want is possible.
So, I have my app which asks for a token to log in, which is that JWT. But it is way too long for someone to manually type. So I supposed there was an algorithm to make this string smaller (compress it) so that it could be easier and faster to type. An example that comes to my mind of how I would use such algorithm is:
short_to_big(small_string) //Returns the original JWT
big_to_short(JWT_string) //Returns the smaller string
Stupid simple answer: use a dict to store the short string as key and the long one as value. Then you just have to generate the short string the way you like and make sure it's not already in the dict. If you need to persist the key/value, you can use almost any kind of database (sql, key:value, document, or even a csv file FWIW).
Oh and if that doesn't solve your problem then you may want to consider giving more context ;)
You need more constraints. A 200 character string contains a lot more information than a 6 character string, so either need to a lot more about the original strings (e.g. that they come from some known set of strings, or have a limited character set) or you need to store the original strings somewhere and use the string the user type as a key to a map or similar.
There are lossless compression algorithms, but these depend on knowing some probabilistic information about the string (e.g. that repeated characters are likely) and will typically expand the strings if the probabilities are wrong.
UPDATE (After question clarification and comments suggestion)
You could implement an algorithm that uniquely maps this big string into a short representation of the string and store this mapping in a dictionary. The following algorithm does not guarantee the uniqueness but should give you some path to follow.
import random
import string
def long_string_to_short(original_string, length=10):
random.seed(original_string)
filling_values = string.digits + string.ascii_letters
short_string = ''.join(random.choice(filling_values) for char_ in xrange(length))
return short_string
When calling the function you can specify an appropriate length for the short string.
Then you could:
my_mapping_dict = {}
my_long_string = 'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpYXQiOjE0NDU0OTk3NDUsImQiOnsiYXV0aF9kYXRhIjoiZm9vIiwib3RoZXJfYXV0aF9kYXRhIjoiYmFyIiwidWlkIjoidW5pcXVlSWQxIn0sInYiOjB9.h6LV3boj0ka2PsyOjZJb8Q48ugiHlEkNksusRGtcUBk'
short_string = long_string_to_short(my_long_string)
my_mapping_dict[short_string] = my_long_string
Ok, so, because I couldn't find a solution for shrinking the string, I tried to give it a different approach, and found a solution.
Now to clarify why I wanted to log in with the token, I'm going to write what I want to do with my app:
In Firebase anyone can create an account, but I don't want that, so for that I made a group of users that were the only ones that could write or read the data.
So in order to create an account, the user would have to request a register code, (Which in reality is a JWT generated from Firebase, so that you have permission to add a user to that group I was talking about).
This app is for local use, meaning that only people that lives here are going to use it. So, back to the original question, the token is too big for someone to type (as I have said many times), and I wanted to know if I could shrink it and how. But without success I tried a different approach, which is to generate the token (from a different program), encrypt it with a random code, and upload it to a firebase, that way I give the random code to people so that users can type it in the app so that it can retrieve and decrypt the token and authenticate with it, so that finally the user has an account that has the privilege to read or write data.
Thanks for your responses and sorry if I wasted your time.
I'm using Python's UUID function to create unique IDs for objects to be stored in a database:
>>> import uuid
>>> print uuid.uuid4()
2eec67d5-450a-48d4-a92f-e387530b1b8b
Is it ok to assume that this is indeed a unique ID?
Or should I double-check that this unique ID has not already been generated against my database before accepting it as valid.
I would use uuid1, which has zero chance of collisions since it takes date/time into account when generating the UUID (unless you are generating a great number of UUID's at the same time).
You can actually reverse the UUID1 value to retrieve the original epoch time that was used to generate it.
uuid4 generates a random ID that has a very small chance of colliding with a previously generated value, however since it doesn't use monotonically increasing epoch time as an input (or include it in the output uuid), a value that was previously generated has a (very) small chance of being generated again in the future.
You should always have a duplicate check, even though the odds are pretty good, you can always have duplicates.
I would recommend just adding a duplicate key constraint in your database and in case of an error retry.
As long as you create all uuids on same system, unless there is a very serious flaw in python implementation (what I really cannot imagine), RFC 4122 states that they will all be distinct (edited : if using version 1,3 or 5).
The only problem that could arise with UUID, were if two systems create UUID exactly at the same moment and :
use same MAC address on their network card (really uncommon) and you are using UUID version 1
or use same name and you are using UUID version 3 or 5
or got same random number and you are using UUID version 4 (*)
So if you have a real MAC address or use an official DNS name or a unique LDAP DN, you can take for true that the generated uuids will be globally unique.
So IMHO, you only have to check unicity if you want to prevent your application against a malicious attack trying to voluntaryly use an existant uuid.
EDIT:
As stated by Martin Konecny, in uuid4 the timestamp part is random too and not monotonic. So the possibilily is collision is very limited but not 0.
I'm trying to generate URLs for my database objects. I've read I should not use the primary key for URLs, and a stub is not a good option for this particular model. Based on the advice in that link, I played around with zlib.crc32() in a Python interpreter and found that values often return negative numbers which I don't want in my URLs. Is there a better hash I should be using to generate my URLs?
UPDATE: I ended up using the bitwise XOR masking method suggested by David below, and it works wonderfully. Thanks to everyone for your input.
First, "don't use primary keys in URLs" is only a very weak guideline. If you are using incremental integer IDs and you don't want to reveal those numbers, then you could obfuscate them a little bit. For example, you could use: masked_id = entity.id ^ 0xABCDEFAB and unmasked_id = masked_id ^ 0xABCDEFAB.
Second, the article you linked to is highly suspicious. I would not trust it. First, CRC32 is a one-way hashing function: it's impossible (in general) to take a CRC32 hash and get back the string used to create that hash. You'll notice that he doesn't show you how to look up a Customer given the CRC32 of their pk. Second, the code in the article doesn't even make sense. The zlib.crc32 function expects a byte string, while Customer.id will be an integer.
Third, be careful if you want to use a slug for a URL: if the slug changes, your URLs will also change. This may be okay, but it's something you'll need to consider.
db.test.find_one(ObjectId('4f3dd96d1453373bcb000000'))
or something else entirely? I know that the _id column is indexed automatically and am hoping to capitalize on that efficiency.
Thanks!
Yes, your approach is correct.
Since you're asking about efficiency, remember that when you're optimizing read operations for performance, you may want to read only the attributes that you need. If certain attributes of your documents are large, then this can reduce the IO costs (transferring data from server to client) dramatically. For example, if your document has 20 attributes, but you're only using 5 of them, then don't pull the other 15 over the wire. In pymongo, you can do this using the optional fields parameter of the collection.find function. Obviously you need to balance performance vs code maintainability here, since listing attributes increases maintenance costs.
More optimization suggestions are available in the official docs. Their list includes "Optimization #3: Select only relevant fields" which is just the point that I made above.
If you're getting a value specifically by the _id, then I would say yes this is the most efficient approach.
Depending on your data, it may be more efficient to index that value and search on it.
if you know the _id than you should call in that way only. db.test.find_one(ObjectId('4f3dd96d1453373bcb000000'))
You full code in pymongo may be like this
connection=Connection(self.host ) #%(self.user_name,self.password))
#connection1=Connection(host=self.host, port=self.port)
db=connection[self.db_name]
db.authenticate(self.user_name, self.password)
collection=db[self.question_collection]
obj_id= ObjectId(_id)
info=collection.find_one(obj_id)