Conversion of float to int in Python (Hill Cipher)

Conversion of float to int in Python (Hill Cipher) - python

I am getting different results for conversion of float to int in my Hill Cipher code (during decryption).
Code: https://github.com/krshrimali/Hill-Cipher/blob/master/hill_cipher.py
Issue: https://github.com/krshrimali/Hill-Cipher/issues/1
Code:
# create empty plain text string
plain_text = ""
# result is a matrix [[260. 574. 439.]]
# addition of 65 because inputs are uppercase letters
for i in range(dimensions):
plain_text += chr(int(result[0][i]) % 26 + 65)
Output: ABS
(the cipher text - encrypted text - was POH)
Result Matrix: (after multiplication of inverse with cipher key matrix)
[[ 260. 574. 539.]]
After conversion to int:
[260, 573, 538]
Can anyone explain why this happens and give a fix on this? Thanks.

The problem is that you're using int, which truncates toward zero.
Math with float values is inherently imprecise. If you don't understand why, the classic explanation is in What Every Computer Scientist Should Know About Floating-Point Numbers. But the short version is that every conversion and every intermediate calculation gets rounded to the nearest 52-bit fraction to the actual number. And that may mean that a calculation that would yield exactly 574 if performed with real numbers actually yields a number a tiny bit more or less than 574 when performed with floats. And if you end up with a number a tiny bit less than 574 and truncate it toward zero withint, you get 573.
In this case, what you want to do is use round instead, which rounds to the nearest integer. As long as you can be sure that your accumulated error is never as large as 0.5, that will do what you want. And, as long as you don't pick ridiculously huge key values (which would be pointless, because you don't get any more security that way), you can be sure of that.
However, there are two things worth considering here.
From a brief scan of the Hill cipher article at Wikipedia: It designed to be performed with quick pencil-and-paper operation. First, you don't need the inverse matrix, just a matrix that's inverse mod 26, which is easier to calculate, and means you stay in smaller numbers that are less likely to have this problem. And it means you can do all the math in integers, so the problem doesn't arise in the first place: create your matrix as an array with dtype=int, and there will be no rounding issues. And, as a bonus, if you do pick ridiculously huge key values, you'll get an error instead of incorrect results. (If you want to allow such values, you'd want to store Python unlimited-size int values in a dtype=object array. But if you don't need that, it just makes things slower and more complicated.)

Related

Reconstructing two (string concatenated) numbers that were originally floats

Unfortunately the printing instruction of a code was written without an end-of-the-line character and one every 26 numbers consists of two numbers joined together. The following is a code that shows an example of such behaviour; at the end there is a fragment of the original database.
import numpy as np
for _ in range(2):
A=np.random.rand()+np.random.randint(0,100)
B=np.random.rand()+np.random.randint(0,100)
C=np.random.rand()+np.random.randint(0,100)
D=np.random.rand()+np.random.randint(0,100)
with open('file.txt','a') as f:
f.write(f'{A},{B},{C},{D}')
And thus the output example file looks very similar to what follows:
40.63358599010553,53.86722741700399,21.800795158561158,13.95828176311762557.217562728494684,2.626308403991772,4.840593988487278,32.401778122213486
With the issue being that there are two numbers 'printed together', in the example they were as follows:
13.95828176311762557.217562728494684
So you cannot know if they should be
13.958281763117625, 57.217562728494684
or
13.9582817631176255, 7.217562728494684
Please understand that in this case they are only two options, but the problem that I want to address considers 'unbounded numbers' which are type Python's "float" (where 'unbounded' means in a range we don't know e.g. in the range +- 1E4)
Can the original numbers be reconstructed based on "some" python internal behavior I'm missing?
Actual data with periodicity 27 (i.e. the 26th number consists of 2 joined together):
0.9221878978925224, 0.9331311610066017,0.8600582424784715,0.8754578588852764,0.8738648974725404, 0.8897837559800233,0.6773502027673041,0.736325377603136,0.7956454122424133, 0.8083168444596229,0.7089031184165164, 0.7475306242508357,0.9702361286847581, 0.9900689384633811,0.7453878225174624, 0.7749000030576826,0.7743879170108678, 0.8032590543649807,0.002434,0.003673,0.004194,0.327903,11.357262,13.782266,20.14374,31.828905,33.9260060.9215201173775437, 0.9349343132442707,0.8605282244327555,0.8741626682026793,0.8742163597524663, 0.8874673376386358,0.7109322043854609,0.7376362393985332,0.796158275345

To expand my comment into an actual answer:
We do have some information - An IEEE-754 standard float only has 32 bits of precision, some of which is taken up by the mantissa (not all numbers can be represented by a float). For datasets like yours, they're brushing up against the edge of that precision.
We can make that work for us - we just need to test whether the number can, in fact, be represented by a float, at each possible split point. We can abuse strings for this, by testing num_str == str(float(num_str)) (i.e. a string remains the same after being converted to a float and back to a string)
If your number is able to be represented exactly by the IEEE float standard, then the before and after will be equal
If the number cannot be represented exactly by the IEEE float standard, it will be coerced into the nearest number that the float can represent. Obviously, if we then convert this back to a string, will not be identical to the original.
Here's a snippet, for example, that you can play around with
def parse_number(s: str) -> List[float]:
if s.count('.') == 2:
first_decimal = s.index('.')
second_decimal = s[first_decimal + 1:].index('.') + first_decimal + 1
split_idx = second_decimal - 1
for i in range(second_decimal - 1, first_decimal + 1, -1):
a, b = s[:split_idx], s[split_idx:]
if str(float(a)) == a and str(float(b)) == b:
return [float(a), float(b)]
# default to returning as large an a as possible
return [float(s[:second_decimal - 1]), float(s[second_decimal - 1:])]
else:
return [float(s)]
parse_number('33.9260060.9215201173775437')
# [33.926006, 0.9215201173775437]
# this is the only possible combination that actually works for this particular input
Obviously this isn't foolproof, and for some numbers there may not be enough information to differentiate the first number from the second. Additionally, for this to work, the tool that generated your data needs to have worked with IEEE standards-compliant floats (which does appear to be the case in this example, but may not be if the results were generated using a class like Decimal (python) or BigDecimal (java) or something else).
Some inputs might also have multiple possibilities. In the above snippet I've biased it to take the longest possible [first number], but you could modify it to go in the opposite order and instead take the shortest possible [first number].

Yes, you have one available weapon: you're using the default precision to display the numbers. In the example you cite, there are 15 digits after the decimal point, making it easy to reconstruct the original numbers.
Let's take a simple case, where you have only 3 digits after the decimal point. It's trivial to separate
13.95857.217
The formatting requires a maximum of 2 digits before the decimal point, and three after.
Any case that has five digits between the points, is trivial to split.
13.958 57.217
However, you run into the "trailing zero" problem in some cases. If you see, instead
13.9557.217
This could be either
13.950 57.217
or
13.955 07.217
Your data do not contain enough information to differentiate the two cases.

Numpy Vectorization - Weird issue

I am performing some vectorized calculation using numpy. I was investigating a bug I am having and I ended with this line:
(vertices[:,:,:,0]+vertices[:,:,:,1]*256)*4
The result was expected to be 100728 for the index vertices[0,0,17], however, I am getting 35192.
When I tried to change it into 4.0 instead of 4, I ended getting the correct value of 100728 and thus fixing my bug.
I would like to understand why the floating point matters here especially that I am using python 3.7 and it is multiplication, not even division.
Extra information:
vertices.shape=(203759, 12, 32, 3)
python==3.7
numpy==1.16.1
Edit 1:
vertices type is "numpy.uint8"
vertices[0, 0, 17] => [94, 98, 63]

The issue here is that you are using too small integers, and the number overflows and wraps around because numpy uses fixed width integers rather than infinite precision like python int's. Numpy will "promote" the type of a result based on the inputs, but it won't promote the result based on whether an overflow happens or not (it's done before the actual calculation.
In this case when you multiply: vertices[:,:,:,1]*256 (I shall call this A), 256 cannot be held in a uint8, so it goes to the next higher type: uint16 this allows the result of the multiplication to hold the correct value in this case, because the maximum possible value of any element in verticies is 255, so the largest value possible is 255*256, which fits just fine in a 16 bit uint.
Then you add vertices[:,:,:,0] + A (I shall call this B). if the largest value of A was 255*256, and the largest value of vertices[:,:,:,0] is 255 (again the largest value of a uint8), the largest sum of the two is equal to 216-1 (the largest value you can hold in a 16 bit unsigned int). This is still fine right up until you go for your last multiplication.
When you get to B * 4, numpy again has to decide what the return type should be. The integer 4 easily fits in a uint16, so numpy does not promote the type higher still to a uint32 or uint64 because it does not preemptively avoid overflows as previously described. This results in any multiplication products greater than 216-1 being returned as modulo 216.
If you instead use a floating point number (4. or 4.0), numpy sees this as a "higher" value type that cannot fit inside a uint16, so it promotes the result to floating point, which can accomodate much higher numbers without overflowing.
If you don't want to change the entire array: verticies to a larger dtype, you could simply take the result B and convert that before you multiply by 4 as such: B.astype(np.uint64) * 4. This will allow you to hold much larger values without overflowing (though it does not actually eliminate the problem if the value is larger than 4 ever).

Is there a way to convert complex number to just number?

So I have this long number (i.e: 1081546747036327937), and when I cleaned up my data in pandas dataframe, I didn't realize Python converted it to complex number (i.e: 1.081546747036328e+18).
I saved this one as csv. The problem is, I accidentally deleted the original file, tried to recover it but no success this far, so...
is there a way to convert this complex number back to their original number?
I tried to convert it to str using str(data) but it stays the same (i.e: 1.081546747036328e+18).

As you were said in comment, this is not a complex number, but a floating point number. You can certainly convert it to a (long) integer, but you cannot be sure to get back the initial number.
In your example:
i = 1081546747036327937
f = float(i)
j = int(f)
print(i, f, j, j-i)
will display:
1081546747036327937 1.081546747036328e+18 1081546747036327936 -1
This is because floating points only have a limited accuracy and rounding errors are to be expected with large integers when the binary representation requires more than 53 bits.

As can be read here, complex numbers are a sum of a real part and an imaginary part.
3+1j is a complex number with the real value 3 and a complex value 1
You have a scientific notation (type is float), which is just an ordinary float multiplied by the specified power of 10.
1e10 equals to 1 times ten to the power of ten
To convert this to int, you can just convert with int(number). For more information about python data types, you can take a look here

Pseudorandom Algorithm for VERY Large (10^1.2mil) Numbers?

I'm looking for a pseudo-random number generator (an algorithm where you input a seed number and it outputs a different 'random-looking' number, and the same seed will always generate the same output) for numbers between 1 and 951,312,000.
I would use the Linear Feedback Shift Register (LFSR) PRNG, but if I did, I would have to convert the seed number (which could be up to 1.2 million digits long in base-10) into a binary number, which would be so massive that I think it would take too long to compute.
In response to a similar question, the Feistel cipher was recommended, but I didn't understand the vocabulary of the wiki page for that method (I'm going into 10th grade so I don't have a degree in encryption), so if you could use layman's terms, I would strongly appreciate it.
Is there an efficient way of doing this which won't take until the end of time, or is this problem impossible?
Edit: I forgot to mention that the prng sequence needs to have a full period. My mistake.

A simple way to do this is to use a linear congruential generator with modulus m = 95^1312000.
The formula for the generator is x_(n+1) = a*x_n + c (mod m). By the Hull-Dobell Theorem, it will have full period if and only if gcd(m,c) = 1 and 95 divides a-1. Furthermore, if you want good second values (right after the seed) even for very small seeds, a and c should be fairly large. Also, your code can't store these values as literals (they would be much too big). Instead, you need to be able to reliably produce them on the fly. After a bit of trial and error to make sure gcd(m,c) = 1, I hit upon:
import random
def get_book(n):
random.seed(1941) #Borges' Library of Babel was published in 1941
m = 95**1312000
a = 1 + 95 * random.randint(1, m//100)
c = random.randint(1, m - 1) #math.gcd(c,m) = 1
return (a*n + c) % m
For example:
>>> book = get_book(42)
>>> book % 10**100
4779746919502753142323572698478137996323206967194197332998517828771427155582287891935067701239737874
shows the last 100 digits of "book" number 42. Given Python's built-in support for large integers, the code runs surprisingly fast (it takes less than 1 second to grab a book on my machine)

If you have a method that can produce a pseudo-random digit, then you can concatenate as many together as you want. It will be just as repeatable as the underlying prng.
However, you'll probably run out of memory scaling that up to millions of digits and attempting to do arithmetic. Normally stuff on that scale isn't done on "numbers". It's done on byte vectors, or something similar.

How do I represent a string as a number?

I need to represent a string as a number, however it is 8928313 characters long, note this string can contain more than just alphabet letters, and I have to be able to convert it back efficiently too. My current (too slow) code looks like this:
alpha = 'abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ,.?!#()+-=[]/*1234567890^*{}\'"$\\&#;|%<>:`~_'
alphaLeng = len(alpha)
def letterNumber(letters):
letters = str(letters)
cof = 1
nr = 0
for i in range(len(letters)):
nr += cof*alpha.find(letters[i])
cof *= alphaLeng
print(i,' ',len(letters))
return str(nr)

Ok, since other people are giving awful answers, I'm going to step in.
You shouldn't do this.
You shouldn't do this.
An integer and an array of characters are ultimately the same thing: bytes. You can access the values in the same way.
Most number representations cap out at 8 bytes (64-bits). You're looking at 8 MB, or 1 million times the largest integer representation. You shouldn't do this. Really.
You shouldn't do this. Your number will just be a custom, gigantic number type that would be identical under the hood.
If you really want to do this, despite all the reasons above, here's how...
Code
def lshift(a, b):
# bitwise left shift 8
return (a << (8 * b))
def string_to_int(data):
sum_ = 0
r = range(len(data)-1, -1, -1)
for a, b in zip(bytearray(data), r):
sum_ += lshift(a, b)
return sum_;
DONT DO THIS
Explanation
Characters are essentially bytes: they can be encoded in different ways, but ultimately you can treat them within a given encoding as a sequence of bytes. In order to convert them to a number, we can shift them left 8-bits for their position in the sequence, creating a unique number. r, the range value, is the position in reverse order: the 4th element needs to go left 24 bytes (3*8), etc.
After getting the range and converting our data to 8-bit integers, we can then transform the data and take the sum, giving us our unique identifier. It will be identical byte-wise (or in reverse byte-order) of the original number, but just "as a number". This is entirely futile. Don't do it.
Performance
Any performance is going to be outweighed by the fact that you're creating an identical object for no valid reason, but this solution is decently performant.
1,000 elements takes ~486 microseconds, 10,000 elements takes ~20.5 ms, while 100,000 elements takes about 1.5 seconds. It would work, but you shouldn't do it. This means it's scaled as O(n**2), which is likely due to memory overhead of reallocating the data each time the integer size gets larger. This might take ~4 hours to process all 8e6 elements (14365 seconds, calculated fitting the lower-order data to ax**2+bx+c). Remember, this is all to get the identical byte representation as the original data.
Futility
Remember, there are ~1e78 to 1e82 atoms in the entire universe, on current estimates. This is ~2^275. Your value will be able to represent 2^71426504, or about 260,000 times as many bits as you need to represent every atom in the universe. You don't need such a number. You never will.

If there are only ANSII characters. You can use ord() and chr().
built-in functions

There are several optimizations you can perform. For example, the find method requires searching through your string for the corresponding letter. A dictionary would be faster. Even faster might be (benchmark!) the chr function (if you're not too picky about the letter ordering) and the ord function to reverse the chr. But if you're not picky about ordering, it might be better if you just left-NULL-padded your string and treated it as a big binary number in memory if you don't need to display the value in any particular format.
You might get some speedup by iterating over characters instead of character indices. If you're using Python 2, a large range will be slow since a list needs to be generated (use xrange instead for Python 2); Python 3 uses a generator, so it's better.
Your print function is going to slow down output a fair bit, especially if you're outputting to a tty.
A big number library may also buy you speed-up: Handling big numbers in code

Your alpha.find() function needs to iterate through alpha on each loop.
You can probably speed things up by using a dict, as dictionary lookups are O(1):
alpha = 'abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ,.?!#()+-=[]/*1234567890^*{}\'"$\\&#;|%<>:`~_'
alpha_dict = { letter: index for index, letter in enumerate(alpha)}
print(alpha.find('$'))
# 83
print(alpha_dict['$'])
# 83

Store your strings in an array of distinct values; i.e. a string table. In your dataset, use a reference number. A reference number of n corresponds to the nth element of the string table array.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.