Python store non numeric string as number - python

I am currently trying to find a way to convert any sort of text to a number, so that it can later be converted back to text.
So something like this:
text = "some string"
number = somefunction(text)
text = someotherfunction(number)
print(text) #output "some string"

If you're using Python 3, it's pretty easy. First, convert the str to bytes in a chosen encoding (utf-8 is usually appropriate), then use int.from_bytes to convert to an int:
number = int.from_bytes(mystring.encode('utf-8'), 'little')
Converting back is slightly trickier (and will lose trailing NUL bytes unless you've stored how long the resulting string should be somewhere else; if you switch to 'big' endianness, you lose leading NUL bytes instead of trailing):
recoveredstring = number.to_bytes((number.bit_length() + 7) // 8, 'little').decode('utf-8')
You can do something similar in Python 2, but it's less efficient/direct:
import binascii
number = int(binascii.hexlify(mystring.encode('utf-8')), 16)
hx = '%x' % number
hx = hx.zfill(len(hx) + (len(hx) & 1)) # Make even length hex nibbles
recoveredstring = binascii.unhexlify(hx).decode('utf-8')
That's equivalent to the 'big' endian approach in Python 3; reversing the intermediate bytes as you go in each direction would get the 'little' effect.

You can use the ASCII values to do this:
ASCII to int:
ord('a') # = 97
Back to a string:
str(unichr(97)) # = 'a'
From there you could iterate over the string one character at a time and store these in another string. Assuming you are using standard ASCII characters, you would need to zero pad the numbers (because some are two digits and some three) like so:
s = 'My string'
number_string = ''
for c in s:
number_string += str(ord(c)).zfill(3)
To decode this, you will read the new string three characters at a time and decode them into a new string.
This assumes a few things:
all characters can be represented by ASCII (you could use Unicode code points if not)
you are storing the numeric value as a string, not as an actual int type (not a big deal in Python—saves you from having to deal with maximum values for int on different systems)
you absolutely must have a numeric value, i.e. some kind of hexadecimal representation (which could be converted into an int) and cryptographic algorithms won't work
we're not talking about GB+ of text that needs to be converted in this manner

Related

Why does a string converted from an array behave differently from another initialized with the same value?

The goal of the program is converting the little_endian string to another string equal to clean_data_little_endian and then to convert it using struct.unpack. However the string clean_data_little_endian behaves differently from the other, that is the result of a conversion from an array.
During debug clean_data_little_endian is à1ÿÏÿÊÿÄ and strBinary_Values is \xE0\x31\xFF\xCF\xFF\xCA\xFF\xC4 and if I try to print them I obtain
:
clean_data_little_endian: b'\xe01\xff\xcf\xff\xca\xff\xc4' <class 'str'>
strBinary_Values: b'\\xE0\\x31\\xFF\\xCF\\xFF\\xCA\\xFF\\xC4' <class 'str'>
(strBinary values has 2 backslashes instead of one)
There must be a difference that I don't know how to remove between them, so that struct.unpack works only with clean_data_little_endian and not with strBinary_Values.
The error returned is:
unpack requires a buffer of 8 bytes
and if I change the buffer the number of bytes required becomes the double and so on.
Here's the code I used, even if I think it will not be necessary to read it.
little_endian = '#800000100?xE0??x31??xFF??xCF??xFF??xCA??xFF??xC4?'
clean_data_little_endian = '\xE0\x31\xFF\xCF\xFF\xCA\xFF\xC4'
#from raw string to clean string
j=0
i=0
listValuesToClean = list(little_endian[10:len(little_endian)])
for i in range(0,len(listValuesToClean)-1):
mod = i % 5
if ((mod == 2) or (mod == 3) or (mod == 1)):
listBinary_Values.append(listValuesToClean[i])
j=j+1
if (mod == 0):
listBinary_Values.append('\\')
j=j+1
strBinary_Values=''.join(listBinary_Values)
print('expected: ',clean_data_little_endian.encode('raw_unicode_escape'),type(strBinary_Values), '\n' 'real: ', strBinary_Values.encode('raw_unicode_escape'),type(clean_data_little_endian))
#from clean string to initial values
iqty_of_values = len(strBinary_Values)/8
h = "H" * int(iqty_of_values)
#correct result:
ivalues = struct.unpack("<"+h,clean_data_little_endian.encode('raw_unicode_escape'))
#wrong result:
ivalues = struct.unpack("<"+h,strBinary_Values.encode('raw_unicode_escape'))
The double backslashes indicate a literal backslash, and it doesn't create the byte values you want. This would fix it. latin1 translates 1:1 Unicode string codepoints to byte values, which is required for unicode_escape to translate the literal escape codes to Unicode string codepoints, but then encoding to latin1 again turns the string back to the bytes required for unpack:
ivalues = struct.unpack("<"+h,strBinary_Values.encode('latin1').decode('unicode_escape').encode('latin1'))
print(ivalues)
# (12768, 53247, 51967, 50431)
From the looks of it, a regular expression to capture the hexadecimal bytes and a direct conversion using bytes.fromhex would be more straightforward:
import re
import struct
little_endian = '#800000100?xE0??x31??xFF??xCF??xFF??xCA??xFF??xC4?'
s = ''.join(re.findall(r'x([0-9A-F]{2})',little_endian))
print(s)
b = bytes.fromhex(s)
print(b)
data = struct.unpack(f'<{len(b)//2}H',b)
print(data)
Output:
E031FFCFFFCAFFC4
b'\xe01\xff\xcf\xff\xca\xff\xc4'
(12768, 53247, 51967, 50431)

How to compare a hex byte with its literal (visual) representation in Python?

I want to compare two entities, one being an int as a single byte and the other a str which is the ASCII code of the visual representation (visual reading) of that byte (not its ASCII value).
For example: I have the byte 0x5a, which I want to compare with a string that says '5a' (or '5A', case is not important). I don't need to compare the byte versus the 'Z' ASCII character, which in my case would be a different thing.
How can I do that?
There are functions that allow you to transform numbers into their string representation, in certain basis. In your case, hex should do the trick. For example:
>>> hex(0x5a)
'0x5a'
>>> hex(0x5a)[2:] # get rid of `0x` if you don't want it
'5a'
You can use hex() to turn the integer into a hex string, and then you can slice off the first two characters using string slicing to remove the leading 0x:
lhs = 90
rhs = "5a"
print(hex(lhs)[2:] == rhs)
This outputs:
True

Convert from ASCII to Hex in Python

I'm trying to convert a string with special characters from ASCII to Hex using python, but it doesn't seem that I'm getting the correct value, noting that it works just fine whenever I try to convert a string that has no special characters. So basically here is what I'm doing:
import binascii
s = "D`Cزف³›"
s_bytes = str.encode(s)
hex_value = str(binascii.hexlify(s_bytes),'ascii')
print (hex_value)
Output
446043d8b2d981c2b316e280ba
Where the output should be (using online converter https://www.rapidtables.com/convert/number/ascii-to-hex.html):
446043632641b3203a
str.encode(s) defaults to utf8 encoding, which doesn't give you the byte values needed to get the desired output. The values you want are simply Unicode ordinals as hexadecimal values, so get the ordinal, convert to hex and join them all together:
s = 'D`Cزف³›'
h = ''.join([f'{ord(c):x}' for c in s])
print(h)
446043632641b3203a
Just realize that Unicode ordinals can be 1-6 hexadecimal digits long, so there is no easy way to reverse the process since you have no spacing of the numbers.

Python incorrectly converts between bytes and hex for me

I have an info_address that I want to convert to delimited hex
info_address_original = b'002dd748'
What i want is
info_address_coded = b'\x00\x2d\xd7\x48'
I tried this solution
info_address_original = b'002dd748'
info_address_intermediary = info_address_original.decode("utf-8") # '002dd748'
info_address_coded = bytes.fromhex( info_address_intermediary ) # b'\x00-\xd7H'
and i get
info_address_coded = b'\x00-\xd7H'
What my debugger shows
How would one go about correctly turning a bytes string like that to delimited hex? It worked implicitly in Python 2 but it doesn't work the way i would want in Python 3.
This is only a representation of the bytes. '-' is the same as '\x2d'.
>>> b'\x00\x2d\xd7\x48' == b'\x00-\xd7H'
True
The default representation of a byte string is to display the character value for all ascii printable characters and the encoded \xhh representation where hh is the hexadecimal value of the byte.
That means that b'\x00\x2d\xd7\x48' and `b'\x00-\xd7H' are the exact same string containing 4 bytes.

How to convert byte string with non-printable chars to hexadecimal in python? [duplicate]

This question already has answers here:
What's the correct way to convert bytes to a hex string in Python 3?
(9 answers)
Closed 7 years ago.
I have an ANSI string Ď–ór˙rXüď\ő‡íQl7 and I need to convert it to hexadecimal like this:
06cf96f30a7258fcef5cf587ed51156c37 (converted with XVI32).
The problem is that Python cannot encode all characters correctly (some of them are incorrectly displayed even here, on Stack Overflow) so I have to deal with them with a byte string.
So the above string is in bytes this: b'\x06\xcf\x96\xf3\nr\x83\xffrX\xfc\xef\\\xf5\x87\xedQ\x15l7'
And that's what I need to convert to hexadecimal.
So far I tried binascii with no success, I've tried this:
h = ""
for i in b'\x06\xcf\x96\xf3\nr\x83\xffrX\xfc\xef\\\xf5\x87\xedQ\x15l7':
h += hex(i)
print(h)
It prints:
0x60xcf0x960xf30xa0x720x830xff0x720x580xfc0xef0x5c0xf50x870xed0x510x150x6c0x37
Okay. It looks like I'm getting somewhere... but what's up with the 0x thing?
When I remove 0x from the string like this:
h.replace("0x", "")
I get 6cf96f3a7283ff7258fcef5cf587ed51156c37 which looks like it's correct.
But sometimes the byte string has a 0 next to a x and it gets removed from the string resulting in a incorrect hexadecimal string. (the string above is missing the 0 at the beginning).
Any ideas?
If you're running python 3.5+, bytes type has an new bytes.hex() method that returns string representation.
>>> h = b'\x06\xcf\x96\xf3\nr\x83\xffrX\xfc\xef\\\xf5\x87\xedQ\x15l7'
b'\x06\xcf\x96\xf3\nr\x83\xffrX\xfc\xef\\\xf5\x87\xedQ\x15l7'
>>> h.hex()
'06cf96f30a7283ff7258fcef5cf587ed51156c37'
Otherwise you can use binascii.hexlify() to do the same thing
>>> import binascii
>>> binascii.hexlify(h).decode('utf8')
'06cf96f30a7283ff7258fcef5cf587ed51156c37'
As per the documentation, hex() converts “an integer number to a lowercase hexadecimal string prefixed with ‘0x’.” So when using hex() you always get a 0x prefix. You will always have to remove that if you want to concatenate multiple hex representations.
But sometimes the byte string has a 0 next to a x and it gets removed from the string resulting in a incorrect hexadecimal string. (the string above is missing the 0 at the beginning).
That does not make any sense. x is not a valid hexadecimal character, so in your solution it can only be generated by the hex() call. And that, as said above, will always create a 0x. So the sequence 0x can never appear in a different way in your resulting string, so replacing 0x by nothing should work just fine.
The actual problem in your solution is that hex() does not enforce a two-digit result, as simply shown by this example:
>>> hex(10)
'0xa'
>>> hex(2)
'0x2'
So in your case, since the string starts with b\x06 which represents the number 6, hex(6) only returns 0x6, so you only get a single digit here which is the real cause of your problem.
What you can do is use format strings to perform the conversion to hexadecimal. That way you can both leave out the prefix and enforce a length of two digits. You can then use str.join to combine it all into a single hexadecimal string:
>>> value = b'\x06\xcf\x96\xf3\nr\x83\xffrX\xfc\xef\\\xf5\x87\xedQ\x15l7'
>>> ''.join(['{:02x}'.format(x) for x in value])
'06cf96f30a7283ff7258fcef5cf587ed51156c37'
This solution does not only work with a bytes string but with really anything that can be formatted as a hexadecimal string (e.g. an integer list):
>>> value = [1, 2, 3, 4]
>>> ''.join(['{:02x}'.format(x) for x in value])
'01020304'

Categories

Resources