Why compare two strings via calculating xor of their characters?

Why compare two strings via calculating xor of their characters? - python

Some time ago I found this function (unfortunately, I don't remember from where it came from, most likely from some Python framework) that compares two strings and returns a bool value. It's quite simple to understand what's going on here.
Finding xor between char returns 1 (True) if they do not match.
def cmp_strings(str1, str2):
return len(str1) == len(str2) and sum(ord(x)^ord(y) for x, y in zip(str1, str2)) == 0
But why is this function used? Isn't it the same as str1==str2?

It takes a similar amount of time to compare any strings that have the same length. It's used for security when the strings are sensitive. Usually it's used to compare password hashes.
If == is used, Python stops comparing characters when the first one not matching is found. This is bad for hashes because it could reveal how close a hash was to matching. This would help an attacker to brute force a password.
This is how hmac.compare_digest works.

The security issue that is being addressed by XOR comparison is known as a Timing Attack. ...This is where you observe how much time it takes the Compare function to succeed|fail, and use that knowledge to gain an advantage over the system.
There are 95 printable ASCII characters. If you have an 8 character password, there are 95^8 (6,634,204,312,890,625) possible combinations ...If the correct password is the last one in your list, and you can try 1 billion passwords per second, it will take you about 77 days to Brute Force the password ...That's too long - so we need a shortcut!
There are an infinite number of ways to store a string - and probably a dozen in popular use {length-prefixed, nul-terminated, ...}{Unicode, UTF-8, ASCII, ,...}. For this working example, I will use the ubiquitous 'NUL-terminated array of bytes using ASCII encoding' ...IE. "ABC" will be stored as "ABC"NUL, or {65, 66, 67, 0} ...but whatever storage/encoding standard you use, the problem is essentially the same.
Syntactically, there are as many ways to compare two strings as there are languages, eg. if str1 == str2 or if (strcmp(str1, str2) == 0) etc. ...but when you look at how they work internally, they are all pretty-much the same. Here is some simple (but realisitic) pseudo-code to perform a classic (non-security) string compare:
index = 0
LOOP FOREVER {
IF ( (str1[index] == 0) AND (str2[index] == 0) ) THEN return 'same'
IF (str1[index] != str2[index]) THEN return 'different'
index = index + 1
}
Assuming the secret password is "BY3"NUL ...Let's try some passwords, and notice how many operations the Compare function has to do to establish success|fail.
1. "A"NUL ... returns 'different' when 1st char is checked (A) [zero chars are correct]
2. "B"NUL ... returns 'different' when 2nd char is checked (NUL) [first char must be correct]
3. "BX"NUL ... returns 'different' when 2nd char is checked (X) [first char must be correct]
4. "BY"NUL ... returns 'different' when 3rd char is checked (NUL) [first two chars must be correct]
5. "BY1"NUL ... returns 'different' when 3rd char is checked (1) [first two chars must be correct]
6. "BY2"NUL ... returns 'different' when 3rd char is checked (2) [first two chars must be correct]
7. "BY3"NUL ... returns 'same' when the 4th character is checked (NUL) [all three chars are correct]
You can see that guess 1 fails the 1st time around the loop, guesses 2 & 3 fail the 2nd time around the loop ...guesses 4, 5, 6 fail the 3rd time around the loop ...and guess 7 succeeds the 4th time around the loop.
By observing how much time it takes the Compare function to fail, we can tell which character is wrong! This means we can actually guess the password one character at a time.
Again, let's assume an 8 character password made up of the 95 printable characters, and our last guess will be correct ...Because we can now guess the password one character at a time, it will take 95*8 (760) guesses. At 1 billion guesses per second, it will take about 0.7 milliseconds to find the password [it takes about 100mS to blink] ...which is a significant advantage over 77 days ...For a laugh work out the advantage for a 20 character password (95^20 vs 95 * 20).
So how do we stop an attacker from using a Timing Attack? [Spoiler: XOR]
The first thing we need to do is to make both strings the same length; and secondly, we must ALWAYS check EVERY character before returning 'same' or 'different' ...This is surprisingly difficult to do without introducing a new Timing Attack. But rather than show you lots of ways to get it wrong, let's see a way to do it right.
Passwords should (where possible) be stored as Hashes ...{DES, MD5, SHA-1, ...} have now been shown to have cryptographic flaws, {SHA-256, SHA-3, Whirlpool, ...} are still in good favour [Oct 2021] ...You may know that ALL Hashes (generated by a given algorithm) are the same length ...So if we Hash the guess and compare the Guess-Hash against the Stored-Hash, we have solved the first problem - the 'strings' (array of bytes) we need to compare are now ALWAYS the same length.
Secondly. How to make sure our Compare function ALWAYS takes the same amount of time to reach its decision ...There are probably a lot of ways to do this, but the most common solution is to use XOR like this:
result = 0
index = 0
LOOP WHILE (index < hashLength) {
result = result OR ( secretHash[index] XOR guessHash[index] )
index = index + 1
}
IF result == 0 THEN return 'same' ELSE return 'different'
And this way ALL calls to the compare function take the same length of time to run ...No more Timing Attack!
Footnote:
For readers not familiar with Boolean Logic - go and read up; but the essence here is:
If A and B are the same, (A XOR B) gives a result of 0
If A and B are different, (A XOR B) gives a non-0 result
If A and B are both 0, (A OR B) gives a result of 0
If either A or B are non-0, (A OR B) gives a non-0 result
So (looking at the second code block) the first time the XOR returns non-0 (different), the result becomes non-0 (different) and can never return to 0 (same).
A search for "cve timing attack" will provide you with a list of real-life examples.

It appears to be doing a correlation (XOR sum) character-wise between the strings, given they are of the same length. It could be required in situations where you need to know 'similarity' and not equality. Maybe that was the plan. The author might have wanted to extend this function further.

Related

Declaring and Looping over a variable in one line

Just for fun, I am trying to compress a programming problem into one line. I know this is typically a bad practice, but it is a fun challenge that I am asking for your help on.
I have a piece of code which declares the variables and in the second line which loops over a list created in the first line, until a number is not found anymore. Finally it returns that value.
The programming question is as follows. Given a sentence, convert each character to it's ascii representation. Then convert that ascii value to binary (filling the remaining spaces with 0 if the binary number is less than 8 digits), and combine the numbers into one string. Starting from the number 0, convert it to binary and check if it is in the string. If it is, add one to the number and check again. Return the last consecutive binary number that is in the string.
Ex)
string = "0000010"
0 in string: add 1
1 in string: add 1
10 in string: add 1
11 not in string: the last consecutive binary number was 102=210. Return 2
You can see my code below
def findLastBinary(s: str):
string, n = ''.join(['0'*(10-len(bin(ord(char))))+bin(ord(char))[2:] for char in s]), 0
while bin(n)[2:] in string: n+=1
return n-1
It would also be nice if I could combine the return statement and loop into one line as well.
EDIT
Fixed the code (it should work now). Also below, you will see a sample test case. Hope this helps with answering this question.
Sample test case
Input:
s="Roses and thorns"
Below you will see the steps my code follows to get the correct answer (obviously made more readable)
Organized into columns in the following order:
Character-Ascii-Binary Representation of ascii value:
R - 82 - 01010010
o - 111 - 01101111
s - 115 - 01110011
etc.
Keep in mind that if the binary number has less than 8 digits, zeros should be added to the beginning of the number until it is 8 digits.
Each binary integer is then concatenated into a single string (I added spaces for readability only):
01010010 01101111 01110011 01100101 01110011 00100000 01100001 01101110 01100100 00100000 01110100 01101000 01101111 01110010 01101110 01110011
Now we start from the binary number 0, and check if it is in the string. It is so we move on to 1. 1 is in the string, so we move on to 10. 10 is in the string. And so we continue until we find the binary string 11111 is not in our string. 111112=3110. Since 31 was the first number whose decimal representation was not in the string, we return the last number whose decimal number was in the string: namely, 31-1=30. 30 is what the function should return.

The problem statement has changed. See the bottom of this answer for the updated solution.
The function can be defined the function this way, thanks to #treuss' observation (this applies to the original problem to find the largest base 10 integer which when converted to binary is in the string):
def largest_binary_number(sentence: str):
return int(''.join([bin(ord(char))[2:].zfill(8) for char in sentence]), 2)
But suppose that the problem was to "find the smallest base 10 integer larger than 1000 whose binary representation is in the string." Then we have something like this:
def find(sentence: str):
return list(iter(lambda: globals().__setitem__('_c', globals().get('_c', 1000-1) + 1) or bin(globals().get('_c'))[2:] in ''.join([bin(ord(c))[2:].zfill(8) for c in sentence]), True)) is type or globals().get('_c')
Let's break this down into four parts:
globals().__setitem__('_c', globals().get('_c', 1000-1) + 1) - initialize and increment a counter
... or bin(globals().get('_c'))[2:] in ''.join([bin(ord(c))[2:].zfill(8) for c in sentence]) - check if the binary representation of the counter is in the binary representation of the sentence
list(iter(lambda: ..., True)) - inline while loop using black magic
... is type or globals().get('_c') - get the final value of the counter, which satisfies our condition
Part 1: globals().__setitem__('_c', globals().get('_c', 1000-1) + 1)
Since we are confined to do everything in one line, we don't have the luxury of defining variables. This is where globals comes in: we can store and use arbitrary variables as dictionary entries using the __setitem__ and get methods. Here we name our counter variable _c, calling get to initialize and fetch the value, then immediately increment it by one and save the value with __setitem__. Now we have a counter variable.
Part 2: ... or bin(globals().get('_c'))[2:] in ''.join([bin(ord(c))[2:].zfill(8) for c in sentence])
bin(globals().get('_c'))[2:] converts the counter to binary and removes the 0b prefix. ''.join([bin(ord(c))[2:].zfill(8) for c in sentence]), as before, converts the input sentence to binary. We use in to check if the binary counter is a substring of the binary sentence. Because the __setitem__ call from part 1 returns None, we use or here to ignore that and execute this part.
Part 3: list(iter(lambda: ..., True))
This is the bread and butter, allowing us to perform inline iteration. iter is usually passed an iterable to create and iterator, but it actually has a second form that takes two arguments: a callable and a sentinel. When iterating over an iterator created using this two-argument form, the callable is successively called until it returns the sentinel value (beware infinite loops!). So we define a lambda function that returns True when the condition is satisfied, and set the sentinel to True. Finally we use the list constructor to begin iterating.
Part 4: ... is type or globals().get('_c')
Once the list constructor finishes iterating, we need to fetch and return the final value of the counter. We follow list(...) with is type to make an expression that always evaluates to False, then chain it with or globals().get('_c') at the end of this one-liner to return the counter. Et voilà!
Part 5:
Of course, what we had before was a two-liner.
find = lambda sentence: list(iter(lambda: globals().__setitem__('_c', globals().get('_c', 1000-1) + 1) or bin(globals().get('_c'))[2:] in ''.join([bin(ord(c))[2:].zfill(8) for c in sentence]), True)) is type or globals().get('_c')
Now we have a one-liner.
Note: In hindsight, maybe the walrus := could be used to make the counter, instead of having to call globals() every time. However, replacing globals with locals doesn't work for some reason.
Note 2: Using these techniques, we can make one-liners that satisfy various conditions.
Update: Here's another version using the walrus
find = lambda sentence: (_c := {'v': 1000-1}) and list(iter(lambda: _c.__setitem__('v', _c['v'] + 1) or bin(_c['v'])[2:] in ''.join([bin(ord(c))[2:].zfill(8) for c in sentence]), True)) is type or _c['v']
We initialize the counter at the top level and simply use _c everywhere else. Note how it is a dict instead of an int because outer variables cannot be assigned within the inner lambda (but mutating outer variables is fine).
Update 2: OP has updated the problem statement, so here's the new solution:
find = lambda s: (_c := {'v': 0-1}) and list(iter(lambda: _c.__setitem__('v', _c['v'] + 1) or bin(_c['v'])[2:] in ''.join([bin(ord(c))[2:].zfill(8) for c in s]), False)) is type or _c['v'] - 1
The techniques are the same, but now we start the counter from -1 (the first iteration increments it to 0 before anything else), the sentinel becomes False (because we stop the loop when the binary counter is not in the binary string), and decrement the return value by 1 to get the last number satisfying the condition.

How to increment a numeric string in Python

I've spent the last two days trying to figure out how to increment a numeric string in Python. I am trying to increment a sequence number when a record is created. I spent all day yesterday trying to do this as an Integer, and it works fine, but I could never get database to store leading zeros. I did extensive research on this topic in StackOverflow, and while there are several examples of how to do this as an Integer and store leading zeros, none of the examples worked for me. Many of the examples were from 2014, so perhaps the methodology has changed. I then switched over to a String and changed my attribute to a CharField, and can get the function to work with leading zeros, but now I can't seem to get it to increment. Again, the examples that I found on SO were from 2014, so maybe things have changed a bit. Here is the function that works, but every time I call it, it doesn't increment. It just returns 00000001. I'm sure it's something simple I'm not doing, but I'm out of ideas. Thanks in advance for your help. Here is the function that works but doesn't increment.
def getNextSeqNo(self):
x = str(int(self.request_number) + 1)
self.request_number = str(x).zfill(8)
return self.request_number
Here is the field as it's defined:
request_number = models.CharField(editable=True,null=True,max_length=254,default="00000")
I added a default of "00000" as the system is giving me the following error if it is not present:
int() argument must be a string, a bytes-like object or a number, not 'NoneType'
I realize the code I have is basically incrementing my default by 1, which is why I'm always getting 00000001 as my sequence number. Can't seem to figure out how to get the current number and then increment by 1. Any help is appreciated.

A times ago I made something similar
You have to convert your string to int and then you must to get its length and then you have to calculate the number of zeros that you need
code = "00000"
code = str(int(code) + 1 )
code_length = len(code)
if code_length < 5: # number five is the max length of your code
code = "0" * (5 - code_length) + code
print(code)

Can this be done? Yes. But don't do it.
Make it an integer.
Incrementing is then trivial - automatic if you make this the primary key. For searching, you convert the string to an integer and search the integer - that way you don't have to worry how many leading zeros were actually included as they will all be ignored. Otherwise you will have a problem if you use 6 digits and the user can't remember and puts in 6 0's + the number and then doesn't get a match.

For those who want to just increase the last number in a string.
Import re
a1 = 'name 1'
num0_m = re.search(r'\d+', str(a1))
if num0_m:
rx = r'(?<!\d){}(?!\d)'.format(num0_m.group())
print(re.sub(rx, lambda x: str(int(x.group()) + 1), a1))

number = int('00000150')
('0'*7 + str(number+1))[-8:]
This takes any number, adds 1, concatenates/joins it to a string of several (at least 7 in your case) zeros, and then slices to return the last 8 characters.
IMHO simpler and more elegant than measuring length and working out how many zeros to add.

How do I represent a string as a number?

I need to represent a string as a number, however it is 8928313 characters long, note this string can contain more than just alphabet letters, and I have to be able to convert it back efficiently too. My current (too slow) code looks like this:
alpha = 'abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ,.?!#()+-=[]/*1234567890^*{}\'"$\\&#;|%<>:`~_'
alphaLeng = len(alpha)
def letterNumber(letters):
letters = str(letters)
cof = 1
nr = 0
for i in range(len(letters)):
nr += cof*alpha.find(letters[i])
cof *= alphaLeng
print(i,' ',len(letters))
return str(nr)

Ok, since other people are giving awful answers, I'm going to step in.
You shouldn't do this.
You shouldn't do this.
An integer and an array of characters are ultimately the same thing: bytes. You can access the values in the same way.
Most number representations cap out at 8 bytes (64-bits). You're looking at 8 MB, or 1 million times the largest integer representation. You shouldn't do this. Really.
You shouldn't do this. Your number will just be a custom, gigantic number type that would be identical under the hood.
If you really want to do this, despite all the reasons above, here's how...
Code
def lshift(a, b):
# bitwise left shift 8
return (a << (8 * b))
def string_to_int(data):
sum_ = 0
r = range(len(data)-1, -1, -1)
for a, b in zip(bytearray(data), r):
sum_ += lshift(a, b)
return sum_;
DONT DO THIS
Explanation
Characters are essentially bytes: they can be encoded in different ways, but ultimately you can treat them within a given encoding as a sequence of bytes. In order to convert them to a number, we can shift them left 8-bits for their position in the sequence, creating a unique number. r, the range value, is the position in reverse order: the 4th element needs to go left 24 bytes (3*8), etc.
After getting the range and converting our data to 8-bit integers, we can then transform the data and take the sum, giving us our unique identifier. It will be identical byte-wise (or in reverse byte-order) of the original number, but just "as a number". This is entirely futile. Don't do it.
Performance
Any performance is going to be outweighed by the fact that you're creating an identical object for no valid reason, but this solution is decently performant.
1,000 elements takes ~486 microseconds, 10,000 elements takes ~20.5 ms, while 100,000 elements takes about 1.5 seconds. It would work, but you shouldn't do it. This means it's scaled as O(n**2), which is likely due to memory overhead of reallocating the data each time the integer size gets larger. This might take ~4 hours to process all 8e6 elements (14365 seconds, calculated fitting the lower-order data to ax**2+bx+c). Remember, this is all to get the identical byte representation as the original data.
Futility
Remember, there are ~1e78 to 1e82 atoms in the entire universe, on current estimates. This is ~2^275. Your value will be able to represent 2^71426504, or about 260,000 times as many bits as you need to represent every atom in the universe. You don't need such a number. You never will.

If there are only ANSII characters. You can use ord() and chr().
built-in functions

There are several optimizations you can perform. For example, the find method requires searching through your string for the corresponding letter. A dictionary would be faster. Even faster might be (benchmark!) the chr function (if you're not too picky about the letter ordering) and the ord function to reverse the chr. But if you're not picky about ordering, it might be better if you just left-NULL-padded your string and treated it as a big binary number in memory if you don't need to display the value in any particular format.
You might get some speedup by iterating over characters instead of character indices. If you're using Python 2, a large range will be slow since a list needs to be generated (use xrange instead for Python 2); Python 3 uses a generator, so it's better.
Your print function is going to slow down output a fair bit, especially if you're outputting to a tty.
A big number library may also buy you speed-up: Handling big numbers in code

Your alpha.find() function needs to iterate through alpha on each loop.
You can probably speed things up by using a dict, as dictionary lookups are O(1):
alpha = 'abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ,.?!#()+-=[]/*1234567890^*{}\'"$\\&#;|%<>:`~_'
alpha_dict = { letter: index for index, letter in enumerate(alpha)}
print(alpha.find('$'))
# 83
print(alpha_dict['$'])
# 83

Store your strings in an array of distinct values; i.e. a string table. In your dataset, use a reference number. A reference number of n corresponds to the nth element of the string table array.

Project Euler #13 Python. Incorrect carry over

This problem asks to sum up 100 numbers, each 50 digits long. http://code.jasonbhill.com/python/project-euler-problem-13/
We can replace \n with "\n+" in Notepad++ yielding
a=37107287533902102798797998220837590246510135740250
+46376937677490009712648124896970078050417018260538
...
+20849603980134001723930671666823555245252804609722
+53503534226472524250874054075591789781264330331690
print(a)
>>37107287533902102798797998220837590246510135740250 (incorrect)
We can as well replace \n with \na+= yielding
a=37107287533902102798797998220837590246510135740250
a+=46376937677490009712648124896970078050417018260538
...
a+=20849603980134001723930671666823555245252804609722
a+=53503534226472524250874054075591789781264330331690
print(a)
>>553... (correct)
This seems to be a feature of BigInteger arithmetic. Under which conditions a sum of all numbers (Method 1) yields different result from an iterative increment (Method 2)?

As you can see in the result, the first set of instruction is not computing the sum. It preserved the first assignment. Since +N is on its own a valid instruction, the next lines after the assignment do nothing. Thus
a=42
+1
print a
prints 42
To write an instruction over two lines, you need to escape the ending newline \n :
a=42\
+1
43

Python source code lines are terminated by newline characters. The subsequent lines in the first example are separate expression statements consisting of a single integer with a unary plus operator in front, but they don't do anything. They evaluate the expression (resulting in the integer constant itself), and then ignore the result. If you put all numbers on a single line, or use parentheses around the addition, the simple sum will work as well.

Strings, ints and leading zeros

I need to record SerialNumber(s) on an object. We enter many objects. Most serial numbers are strings - the numbers aren't used numerically, just as unique identifiers - but they are often sequential. Further, leading zeros are important due to unique id status of serial number.
When doing data entry, it's nice to just enter the first "sequential" serial number (eg 000123) and then the number of items (eg 5) to get the desired output - that way we can enter data in bulk see below:
Obj1.serial = 000123
Obj2.serial = 000124
Obj3.serial = 000125
Obj4.serial = 000126
Obj5.serial = 000127
The problem is that when you take the first number-as-string, turn to integer and increment, you loose the leading zeros.
Not all serials are sequential - not all are even numbers (eg FDM-434\RRTASDVI908)
But those that are, I would like to automate entry.
In python, what is the most elegant way to check for leading zeros (*and, I guess, edge cases like 0009999) in a string before iterating, and then re-application of those zeros after increment?
I have a solution to this problem but it isn't elegant. In fact, it's the most boring and blunt alg possible.
Is there an elegant solution to this problem?
EDIT
To clarify the question, I want the serial to have the same number of digits after the increment.
So, in most cases, this will mean reapplying the same number of leading zeros. BUT in some edge cases the number of leading zeros will be decremented. eg: 009 -> 010; 0099 -> 0100

Try str.zfill():
>>> s = "000123"
>>> i = int(s)
>>> i
123
>>> n = 6
>>> str(i).zfill(n)
'000123'

I develop my comment here, Obj1.serial being a string:
Obj1.serial = "000123"
('%0'+str(len(Obj1.serial))+'d') % (1+int(Obj1.serial))
It's like #owen-s answer '%06d' % n: print the number and pad with leading 0.
Regarding '%d' % n, it's just one way of printing. From PEP3101:
In Python 3.0, the % operator is supplemented by a more powerful
string formatting method, format(). Support for the str.format()
method has been backported to Python 2.6.
So you may want to use format instead… Anyway, you have an integer at the right of the % sign, and it will replace the %d inside the left string.
'%06d' means print a minimum of 6 (6) digits (d) long, fill with 0 (0) if necessary.
As Obj1.serial is a string, you have to convert it to an integer before the increment: 1+int(Obj1.serial). And because the right side takes an integer, we can leave it like that.
Now, for the left part, as we can't hard code 6, we have to take the length of Obj1.serial. But this is an integer, so we have to convert it back to a string, and concatenate to the rest of the expression %0 6 d : '%0'+str(len(Obj1.serial))+'d'. Thus
('%0'+str(len(Obj1.serial))+'d') % (1+int(Obj1.serial))
Now, with format (format-specification):
'{0:06}'.format(n)
is replaced in the same way by
('{0:0'+str(len(Obj1.serial))+'}').format(1+int(Obj1.serial))

You could check the length of the string ahead of time, then use rjust to pad to the same length afterwards:
>>> s = "000123"
>>> len_s = len(s)
>>> i = int(s)
>>> i
123
>>> str(i).rjust(len_s, "0")
'000123'
You can check a serial number for all digits using:
if serial.isdigit():

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why compare two strings via calculating xor of their characters? - python

It appears to be doing a correlation (XOR sum) character-wise between the strings, given they are of the same length. It could be required in situations where you need to know 'similarity' and not equality. Maybe that was the plan. The author might have wanted to extend this function further.

Related

Declaring and Looping over a variable in one line

How to increment a numeric string in Python

How do I represent a string as a number?

Project Euler #13 Python. Incorrect carry over

Strings, ints and leading zeros

Categories

Resources