How do I represent a string as a number? - python

I need to represent a string as a number, however it is 8928313 characters long, note this string can contain more than just alphabet letters, and I have to be able to convert it back efficiently too. My current (too slow) code looks like this:
alpha = 'abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ,.?!#()+-=[]/*1234567890^*{}\'"$\\&#;|%<>:`~_'
alphaLeng = len(alpha)
def letterNumber(letters):
letters = str(letters)
cof = 1
nr = 0
for i in range(len(letters)):
nr += cof*alpha.find(letters[i])
cof *= alphaLeng
print(i,' ',len(letters))
return str(nr)

Ok, since other people are giving awful answers, I'm going to step in.
You shouldn't do this.
You shouldn't do this.
An integer and an array of characters are ultimately the same thing: bytes. You can access the values in the same way.
Most number representations cap out at 8 bytes (64-bits). You're looking at 8 MB, or 1 million times the largest integer representation. You shouldn't do this. Really.
You shouldn't do this. Your number will just be a custom, gigantic number type that would be identical under the hood.
If you really want to do this, despite all the reasons above, here's how...
Code
def lshift(a, b):
# bitwise left shift 8
return (a << (8 * b))
def string_to_int(data):
sum_ = 0
r = range(len(data)-1, -1, -1)
for a, b in zip(bytearray(data), r):
sum_ += lshift(a, b)
return sum_;
DONT DO THIS
Explanation
Characters are essentially bytes: they can be encoded in different ways, but ultimately you can treat them within a given encoding as a sequence of bytes. In order to convert them to a number, we can shift them left 8-bits for their position in the sequence, creating a unique number. r, the range value, is the position in reverse order: the 4th element needs to go left 24 bytes (3*8), etc.
After getting the range and converting our data to 8-bit integers, we can then transform the data and take the sum, giving us our unique identifier. It will be identical byte-wise (or in reverse byte-order) of the original number, but just "as a number". This is entirely futile. Don't do it.
Performance
Any performance is going to be outweighed by the fact that you're creating an identical object for no valid reason, but this solution is decently performant.
1,000 elements takes ~486 microseconds, 10,000 elements takes ~20.5 ms, while 100,000 elements takes about 1.5 seconds. It would work, but you shouldn't do it. This means it's scaled as O(n**2), which is likely due to memory overhead of reallocating the data each time the integer size gets larger. This might take ~4 hours to process all 8e6 elements (14365 seconds, calculated fitting the lower-order data to ax**2+bx+c). Remember, this is all to get the identical byte representation as the original data.
Futility
Remember, there are ~1e78 to 1e82 atoms in the entire universe, on current estimates. This is ~2^275. Your value will be able to represent 2^71426504, or about 260,000 times as many bits as you need to represent every atom in the universe. You don't need such a number. You never will.

If there are only ANSII characters. You can use ord() and chr().
built-in functions

There are several optimizations you can perform. For example, the find method requires searching through your string for the corresponding letter. A dictionary would be faster. Even faster might be (benchmark!) the chr function (if you're not too picky about the letter ordering) and the ord function to reverse the chr. But if you're not picky about ordering, it might be better if you just left-NULL-padded your string and treated it as a big binary number in memory if you don't need to display the value in any particular format.
You might get some speedup by iterating over characters instead of character indices. If you're using Python 2, a large range will be slow since a list needs to be generated (use xrange instead for Python 2); Python 3 uses a generator, so it's better.
Your print function is going to slow down output a fair bit, especially if you're outputting to a tty.
A big number library may also buy you speed-up: Handling big numbers in code

Your alpha.find() function needs to iterate through alpha on each loop.
You can probably speed things up by using a dict, as dictionary lookups are O(1):
alpha = 'abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ,.?!#()+-=[]/*1234567890^*{}\'"$\\&#;|%<>:`~_'
alpha_dict = { letter: index for index, letter in enumerate(alpha)}
print(alpha.find('$'))
# 83
print(alpha_dict['$'])
# 83

Store your strings in an array of distinct values; i.e. a string table. In your dataset, use a reference number. A reference number of n corresponds to the nth element of the string table array.

Related

String conversion/shortening to a fixed length similar to url-shortener

I need to shorten a unique string ID to a maximum of 12 characters.
The ID could be longer or shorter than 12 characters before the conversion but its length has to be shorter or equal to 12 after the conversion. It could also be represented by int or even float after conversion.
Using this function on the same string should always return the same shortened ID. However, it should never return the same value for two different IDs.
(I know, theoretically, this is not possible with a fixed number of output chars, but if it's reasonably unlikely to produce the same result twice, that's okay, because I am only dealing with a few thousand IDs.)
I was thinking of a hash function, but you can't really specify the length of the return value.
A benefit would be reversibility of the function, as a URL shortener but I can also create a dictionary for that purpose.
Any hints to an algorithm that works in this scenario are appreciated!
Let's do some maths. With 12 case insensitive alphanumeric characters in the output, you will have 36 different output characters (26 letters + 10 numbers), and 36^12 possible different outputs. If the hash function is good, the entropy in that will be log2(36^12) = 62 bits.
According to the birthday paradox though, the square root of that many possibilities will already yield a 50% chance of collision, ie. in 2^31 hashes there will very likely be one, 50% is a lot. 2^31 is not that much, a little more than 2 billion.
With n hashes, with a perfect cryptographic hash function you will get a collision chance of p:
n=1000: p=10^-13
n=10000: p=10^-11
n=100000: p=10^-9
n=1000000: p=10^-7
...
If you take the first several characters of a known good hash like SHA2, you will mostly be good. However, note that SHA2 output in a hex-encoded form has a lot less entropy, only 4 bits per character, so 12 characters of the hex representation of a hash output will only have (slightly less than) 48 bits of entropy. Using 1000 such values will have a little less than 1.77 * 10^-9 chance for a collision, 10000 will have 1.77 * 10^-7 chance, 100000 will be 1.77 * 10^-5, 1 million will already be in the 0.1% order of magnitude and so on.
Only you can tell whether that's good enough.

Reconstructing two (string concatenated) numbers that were originally floats

Unfortunately the printing instruction of a code was written without an end-of-the-line character and one every 26 numbers consists of two numbers joined together. The following is a code that shows an example of such behaviour; at the end there is a fragment of the original database.
import numpy as np
for _ in range(2):
A=np.random.rand()+np.random.randint(0,100)
B=np.random.rand()+np.random.randint(0,100)
C=np.random.rand()+np.random.randint(0,100)
D=np.random.rand()+np.random.randint(0,100)
with open('file.txt','a') as f:
f.write(f'{A},{B},{C},{D}')
And thus the output example file looks very similar to what follows:
40.63358599010553,53.86722741700399,21.800795158561158,13.95828176311762557.217562728494684,2.626308403991772,4.840593988487278,32.401778122213486
With the issue being that there are two numbers 'printed together', in the example they were as follows:
13.95828176311762557.217562728494684
So you cannot know if they should be
13.958281763117625, 57.217562728494684
or
13.9582817631176255, 7.217562728494684
Please understand that in this case they are only two options, but the problem that I want to address considers 'unbounded numbers' which are type Python's "float" (where 'unbounded' means in a range we don't know e.g. in the range +- 1E4)
Can the original numbers be reconstructed based on "some" python internal behavior I'm missing?
Actual data with periodicity 27 (i.e. the 26th number consists of 2 joined together):
0.9221878978925224, 0.9331311610066017,0.8600582424784715,0.8754578588852764,0.8738648974725404, 0.8897837559800233,0.6773502027673041,0.736325377603136,0.7956454122424133, 0.8083168444596229,0.7089031184165164, 0.7475306242508357,0.9702361286847581, 0.9900689384633811,0.7453878225174624, 0.7749000030576826,0.7743879170108678, 0.8032590543649807,0.002434,0.003673,0.004194,0.327903,11.357262,13.782266,20.14374,31.828905,33.9260060.9215201173775437, 0.9349343132442707,0.8605282244327555,0.8741626682026793,0.8742163597524663, 0.8874673376386358,0.7109322043854609,0.7376362393985332,0.796158275345
To expand my comment into an actual answer:
We do have some information - An IEEE-754 standard float only has 32 bits of precision, some of which is taken up by the mantissa (not all numbers can be represented by a float). For datasets like yours, they're brushing up against the edge of that precision.
We can make that work for us - we just need to test whether the number can, in fact, be represented by a float, at each possible split point. We can abuse strings for this, by testing num_str == str(float(num_str)) (i.e. a string remains the same after being converted to a float and back to a string)
If your number is able to be represented exactly by the IEEE float standard, then the before and after will be equal
If the number cannot be represented exactly by the IEEE float standard, it will be coerced into the nearest number that the float can represent. Obviously, if we then convert this back to a string, will not be identical to the original.
Here's a snippet, for example, that you can play around with
def parse_number(s: str) -> List[float]:
if s.count('.') == 2:
first_decimal = s.index('.')
second_decimal = s[first_decimal + 1:].index('.') + first_decimal + 1
split_idx = second_decimal - 1
for i in range(second_decimal - 1, first_decimal + 1, -1):
a, b = s[:split_idx], s[split_idx:]
if str(float(a)) == a and str(float(b)) == b:
return [float(a), float(b)]
# default to returning as large an a as possible
return [float(s[:second_decimal - 1]), float(s[second_decimal - 1:])]
else:
return [float(s)]
parse_number('33.9260060.9215201173775437')
# [33.926006, 0.9215201173775437]
# this is the only possible combination that actually works for this particular input
Obviously this isn't foolproof, and for some numbers there may not be enough information to differentiate the first number from the second. Additionally, for this to work, the tool that generated your data needs to have worked with IEEE standards-compliant floats (which does appear to be the case in this example, but may not be if the results were generated using a class like Decimal (python) or BigDecimal (java) or something else).
Some inputs might also have multiple possibilities. In the above snippet I've biased it to take the longest possible [first number], but you could modify it to go in the opposite order and instead take the shortest possible [first number].
Yes, you have one available weapon: you're using the default precision to display the numbers. In the example you cite, there are 15 digits after the decimal point, making it easy to reconstruct the original numbers.
Let's take a simple case, where you have only 3 digits after the decimal point. It's trivial to separate
13.95857.217
The formatting requires a maximum of 2 digits before the decimal point, and three after.
Any case that has five digits between the points, is trivial to split.
13.958 57.217
However, you run into the "trailing zero" problem in some cases. If you see, instead
13.9557.217
This could be either
13.950 57.217
or
13.955 07.217
Your data do not contain enough information to differentiate the two cases.

Best way to get length of numpy unicode string dtype

I am trying to determine the maximum element length of a numpy unicode array. For example, if I have:
# (dtypes added for clarity)
a = np.array(['a'], dtype='U5')
print(get_dtype_length(a))
I'd like it to print 5.
I can do something like:
def get_dtype_length(a):
dtype = a.dtype
dtype_string = dtype.descr[0][1] # == '<U5'
length = int(dtype_string[2:])
return length
But that seems like a roundabout way of inferring something that must be available somewhere. Is there an attribute or numpy function that I haven't found to do this directly?
Clarification based on comments:
I am specifically looking for the maximum allowable length of any element in the array, not the length of any specific element (eg, not len(a[0]) == 1. The motivation behind this is that if I try to update a by something like a[0] = 'string_longer_than_dtype_of_a' I don't want the element to truncate to stri.
In numpy version 1.19 I believe np.can_cast(newVal.dtype, a.dtype, casting='safe') would be a valid test for my use case (as in 1.19 safe will also test if casting results in truncation), but it still doesn't actually solve the question of testing character size.
The 4 in U4 is the length of the string for each element, not the size of the character:
The first character specifies the kind of data and the remaining characters specify the number of bytes per item, except for Unicode, where it is interpreted as the number of characters.
From the docs.
The size of a single Unicode character can be a constant in your program:
sizeof_numpy_unicode_char = np.dtype('U1').itemsize
You can then divide the total number of bytes per element by this constant to get buffer sizes, using either dtype.itemsize, or the shortcut ndarray.itemsize:
def get_length(a):
return a.itemsize // sizeof_numpy_unicode_char
But the size of characters is indeed fixed (usually at 4 bytes).

Conversion of float to int in Python (Hill Cipher)

I am getting different results for conversion of float to int in my Hill Cipher code (during decryption).
Code: https://github.com/krshrimali/Hill-Cipher/blob/master/hill_cipher.py
Issue: https://github.com/krshrimali/Hill-Cipher/issues/1
Code:
# create empty plain text string
plain_text = ""
# result is a matrix [[260. 574. 439.]]
# addition of 65 because inputs are uppercase letters
for i in range(dimensions):
plain_text += chr(int(result[0][i]) % 26 + 65)
Output: ABS
(the cipher text - encrypted text - was POH)
Result Matrix: (after multiplication of inverse with cipher key matrix)
[[ 260. 574. 539.]]
After conversion to int:
[260, 573, 538]
Can anyone explain why this happens and give a fix on this? Thanks.
The problem is that you're using int, which truncates toward zero.
Math with float values is inherently imprecise. If you don't understand why, the classic explanation is in What Every Computer Scientist Should Know About Floating-Point Numbers. But the short version is that every conversion and every intermediate calculation gets rounded to the nearest 52-bit fraction to the actual number. And that may mean that a calculation that would yield exactly 574 if performed with real numbers actually yields a number a tiny bit more or less than 574 when performed with floats. And if you end up with a number a tiny bit less than 574 and truncate it toward zero withint, you get 573.
In this case, what you want to do is use round instead, which rounds to the nearest integer. As long as you can be sure that your accumulated error is never as large as 0.5, that will do what you want. And, as long as you don't pick ridiculously huge key values (which would be pointless, because you don't get any more security that way), you can be sure of that.
However, there are two things worth considering here.
From a brief scan of the Hill cipher article at Wikipedia: It designed to be performed with quick pencil-and-paper operation. First, you don't need the inverse matrix, just a matrix that's inverse mod 26, which is easier to calculate, and means you stay in smaller numbers that are less likely to have this problem. And it means you can do all the math in integers, so the problem doesn't arise in the first place: create your matrix as an array with dtype=int, and there will be no rounding issues. And, as a bonus, if you do pick ridiculously huge key values, you'll get an error instead of incorrect results. (If you want to allow such values, you'd want to store Python unlimited-size int values in a dtype=object array. But if you don't need that, it just makes things slower and more complicated.)

Pseudorandom Algorithm for VERY Large (10^1.2mil) Numbers?

I'm looking for a pseudo-random number generator (an algorithm where you input a seed number and it outputs a different 'random-looking' number, and the same seed will always generate the same output) for numbers between 1 and 951,312,000.
I would use the Linear Feedback Shift Register (LFSR) PRNG, but if I did, I would have to convert the seed number (which could be up to 1.2 million digits long in base-10) into a binary number, which would be so massive that I think it would take too long to compute.
In response to a similar question, the Feistel cipher was recommended, but I didn't understand the vocabulary of the wiki page for that method (I'm going into 10th grade so I don't have a degree in encryption), so if you could use layman's terms, I would strongly appreciate it.
Is there an efficient way of doing this which won't take until the end of time, or is this problem impossible?
Edit: I forgot to mention that the prng sequence needs to have a full period. My mistake.
A simple way to do this is to use a linear congruential generator with modulus m = 95^1312000.
The formula for the generator is x_(n+1) = a*x_n + c (mod m). By the Hull-Dobell Theorem, it will have full period if and only if gcd(m,c) = 1 and 95 divides a-1. Furthermore, if you want good second values (right after the seed) even for very small seeds, a and c should be fairly large. Also, your code can't store these values as literals (they would be much too big). Instead, you need to be able to reliably produce them on the fly. After a bit of trial and error to make sure gcd(m,c) = 1, I hit upon:
import random
def get_book(n):
random.seed(1941) #Borges' Library of Babel was published in 1941
m = 95**1312000
a = 1 + 95 * random.randint(1, m//100)
c = random.randint(1, m - 1) #math.gcd(c,m) = 1
return (a*n + c) % m
For example:
>>> book = get_book(42)
>>> book % 10**100
4779746919502753142323572698478137996323206967194197332998517828771427155582287891935067701239737874
shows the last 100 digits of "book" number 42. Given Python's built-in support for large integers, the code runs surprisingly fast (it takes less than 1 second to grab a book on my machine)
If you have a method that can produce a pseudo-random digit, then you can concatenate as many together as you want. It will be just as repeatable as the underlying prng.
However, you'll probably run out of memory scaling that up to millions of digits and attempting to do arithmetic. Normally stuff on that scale isn't done on "numbers". It's done on byte vectors, or something similar.

Categories

Resources