is integer comparison in Python constant time? Can I use it to compare a user-provided int token with a server-stored int for crypto in the way I would compare strings with constant_time_compare from django.utils.crypto, i.e. without suffering timing attacks?
Alternatively, is it more secure to convert to a string and then use the above function?
The answer is yes for a given size of integer - by default python integers that get big become long and then have potentially infinite length - the compare time then grows with the size. If you restrict the size of the integer to a ctypes.c_uint64 or ctypes.c_uint32 this will not be the case.
Note that compare with 0 is a special case, normally much faster, due to the hardware actions many CPUs have a special flag for 0, but if you are using/allowing seeds or tokens with a values of 0 you are asking for trouble.
Related
I want to come up with a function that assigns unique values to a string based on it's lexicographic order. For instance if my function is labelled as get_key(s), the function should take as input a string s and return a unique integer which will allow me to compare two strings based on those unique integers that I get , in O(1) time.
Some code for clarity:
get_key('aaa')
#Returns some integer
get_key('b')
#Returns another integer > output of get_key('aaa') since 'b' > 'aaa'
Any help would be highly appreciated.
Note: Cannot use python built in function id()
It's impossible.
Why? No matter what number you return for a string, I can always find a new string that's in between those two.
You would need an unlimited number of values, because there's an infinite amount of strings.
If I understand your problem clearly, one idea I come to is to convert the input to hex then from hex to int, this I believe would solve the problem, however, I guess it is impossible to solve it in O(1). The solution I provided (and every possible solution in my mind) needs O(n) since you don't have any specification on the input length and the function will operate depending on the length of the input.
For example,
int(x)
float(x)
str(x)
What is time complexity of them?
There is no definite answer to this because it depends not just what type you're converting to, but also what type you're converting from.
Let's consider just numbers and strings. To avoid writing "log" everywhere, we'll measure the size of an int by saying n is how many bits or digits it takes to represent it. (Asymptotically it doesn't matter if you count bits or digits.) For strings, obviously we should let n be the length of the string. There is no meaningful way to measure the "input size" of a float object, since floating-point numbers all take the same amount of space.
Converting an int, float or str to its own type ought to take Θ(1) time because they are immutable objects, so it's not even necessary to make a copy.
Converting an int to a float ought to take Θ(1) time because you only need to read at most a fixed constant number of bits from the int object to find the mantissa, and the bit length to find the exponent.
Converting an int to a str ought to take Θ(n2) time, because you have to do Θ(n) division and remainder operations to find n digits, and each arithmetic operation takes Θ(n) time because of the size of the integers involved.
Converting a str to an int ought to take Θ(n2) time because you need to do Θ(n) multiplications and additions on integers of size Θ(n).
Converting a str to a float ought to take Θ(n) time. The algorithm only needs to read a fixed number of characters from the string to do the conversion, and floating-point arithmetic operations (or operations on bounded int values to avoid intermediate rounding errors) for each character take Θ(1) time; but the algorithm needs to look at the rest of the characters anyway in order to raise a ValueError if the format is wrong.
Converting a float to any type takes Θ(1) time because there are only finitely many distinct float values.
I've said "ought to" because I haven't checked the actual source code; this is based on what the conversion algorithms need to do, and the assumption that the algorithms actually used aren't asymptotically worse than they need to be.
There could be special cases to optimise the str-to-int conversion when the base is a power of 2, like int('11001010', 2) or int('AC5F', 16), since this can be done without arithmetic. If those cases are optimised then they should take Θ(n) time instead of Θ(n2). Likewise, converting an int to a str in a base which is a power of 2 (e.g. using the bin or hex functions) should take Θ(n) time.
Float(x) is more complex among these, as it has a very long range. At the same time it depends on how much of the value you are using.
I am searching for a library where I need to hash a string which should producer numbers rather than alpha numeric
eg:
Input string: hello world
Salt value: 5467865390
Output value: 9223372036854775808
I have searched many libraries, but those library produces alpha-numeric as output, but I need plain numbers as output.
Is there is any such library? Though the problem of having only numbers as output will have high chance of collision, but though it is fine for my business use case.
EDIT 1:
Also I need to control the number of digits in output. I want to store the value in database which has Numeric datatype. So I need to control the number of digits to fit the size within the data type range
Hexadecimal hash codes can be interpreted as (rather large) numbers:
import hashlib
hex_hash = hashlib.sha1('hello world'.encode('utf-8')).hexdigest()
int_hash = int(hex_hash, 16) # convert hexadecimal to integer
print(hex_hash)
print(int_hash)
outputs
'2aae6c35c94fcfb415dbe95f408b9ce91ee846ed'
243667368468580896692010249115860146898325751533
EDIT: As asked in the comments, to limit the number to a certain range, you can simply use the modulus operator. Note, of course, that this will increase the possibility of collisions. For instance, we can limit the "hash" to 0 .. 9,999,999 with modulus 10,000,000.
limited_hex_hash = hex_hash % 10_000_000
print(limited_hex_hash)
outputs
5751533
I think there is no need for libraries. You can simply accomplish this with hash() function in python.
InputString="Hello World!!"
HashValue=hash(InputString)
print(HashValue)
print(type(HashValue))
Output:
8831022758553168752
<class 'int'>
Solution for the problem based on Latest EDIT :
The above method is the simplest solution, changing the hash for each invocation will help us prevent attackers from tampering our application.
If you like to switch off the randomization you can simply do that by assigning
PYTHONHASHSEED to zero.
For information on switching off the randomization check the official docs https://docs.python.org/3.3/using/cmdline.html#cmdoption-R
I'm trying to evaluate if comparing two string get slower as their length increases. My calculations suggest comparing strings should take an amortized constant time, but my Python experiments yield strange results:
Here is a plot of string length (1 to 400) versus time in milliseconds. Automatic garbage collection is disabled, and gc.collect is run between every iteration.
I'm comparing 1 million random strings each time, counting matches as follows.The process is repeated 50 times before taking the min of all measured times.
for index in range(COUNT):
if v1[index] == v2[index]:
matches += 1
else:
non_matches += 1
What might account for the sudden increase around length 64?
Note: The following snippet can be used to try to reproduce the problem assuming v1 and v2 are two lists of random strings of length n and COUNT is their length.
timeit.timeit("for i in range(COUNT): v1[i] == v2[i]",
"from __main__ import COUNT, v1, v2", number=50)
Further note: I've made two extra tests: comparing string with is instead of == suppresses the problem completely, and the performance is about 210ms/1M comparisons.
Since interning has been mentioned, I made sure to add a white space after each string, which should prevent interning; that doesn't change anything. Is it something else than interning then?
Python can 'intern' short strings; stores them in a special cache, and re-uses string objects from that cache.
When then comparing strings, it'll first test if it is the same pointer (e.g. an interned string):
if (a == b) {
switch (op) {
case Py_EQ:case Py_LE:case Py_GE:
result = Py_True;
goto out;
// ...
Only if that pointer comparison fails does it use a size check and memcmp to compare the strings.
Interning normally only takes place for identifiers (function names, arguments, attributes, etc.) however, not for string values created at runtime.
Another possible culprit is string constants; string literals used in code are stored as constants at compile time and reused throughout; again only one object is created and identity tests are faster on those.
For string objects that are not the same, Python tests for equal length, equal first characters then uses the memcmp() function on the internal C strings. If your strings are not interned or otherwise are reusing the same objects, all other speed characteristics come down to the memcmp() function.
I am just making wild guesses but you asked "what might" rather than what does so here are some possibilities:
The CPU cache line size is 64 bytes and longer strings cause a cache miss.
Python might store strings of 64 bytes in one kind of structure and longer strings in a more complicated structure.
Related to the last one: it might zero-pad strings into a 64-byte array and is able to use very fast SSE2 vector instructions to match two strings.
I have a hash function in Python.
It returns a value.
How do I see the byte-size of this return value? I want to know if it is 4-bytes or 8 or what.
Reason:
I want to make sure that the min value is 0 and the max value is 2**32, otherwise my calculations are incorrect.
I want to make sure that packing it to a I struct (unsigned int) is correct.
More specifically, I am calling murmur.string_hash(`x`).
I want to know sanity-check that I am getting a 4-byte unsigned return value. If I have a value of a different size, them my calculations get messed up. So I want to sanity check it.
If it's an arbitrary function that returns a number, there are only 4 standard types of numbers in Python: small integers (C long, at least 32 bits), long integers ("unlimited" precision), floats (C double), and complex numbers.
If you are referring to the builtin hash, it returns a standard integer (C long):
>>> hash(2**31)
-2147483648
If you want different hashes, check out hashlib.
Generally, thinking of a return value as a particular byte precision in Python is not the best way to go, especially with integers. For most intents and purposes, Python "short" integers are seamlessly integrated with "long" (unlimited) integers. Variables are promoted from the smaller to the larger type as necessary to hold the required value. Functions are not required to return any particular type (the same function could return different data types depending on the input, for example).
When a function is provided by a third-party package (as this one is), you can either just trust the documentation (which for Murmur indicates 4-byte ints as far as I can tell) or test the return value yourself before using it (whether by if, assert, or try, depending on your preference).