obtain the decimal value from a hexadecimal string - python

I'm having some troubles converting a hexadecimal value to decimal. Theoretically is easy with Python:
value = '000100000001f8c65400fefe3195000001230000000000000000000000000000000000000000642b00090700000000001e'
print(int(value,16))
And this is the result:
153914086775326342143664486282692693880080977757509806775956245674142536051238079779640236240803190331364310253598
Since here all is ok.
This string represents a payload of different bytes and I know two things:
1 byte = 8 bits.
Hexadecimal values are usually represented with two hexadecimal values, like FF 2E 32 etc.
The problem comes when I want to work with some concrete byte, because theoretically I know that in the byte 18, 19, 20 and 21 I have some decimal number that starts in 39 (I don't know the other numbers that follow). But when I want to decode it I can't find it.
# First try
a = value[36:43] # 18*2 to 21*2
print(a)
print(int(a,16))
# Second try
a = value[18:22] # 18 to 21
print(a)
print(int(a,16))
With a naked eye, I can see that the third and fourth value in the first result is this 39,
153914086775326342143664486282692693880080977757509806775956245674142536051238079779640236240803190331364310253598
But another time, if I do
# third try
a = value[2:4]
print(a)
print(int(a,16))
I can't find this 39, and the values change from the first result.
How I have to do it? I'm sure is easy but I don't know how to do it. I want to learn to access to the different bytes but I can't understand the logic.
EDIT trying to explain it better
I have this hexadecimal payload:
153914086775326342143664486282692693880080977757509806775956245674142536051238079779640236240803190331364310253598
And this represents the set of different values collected in bytes.
Therefore, what I try is to be able to access a byte (or a set) to know what its value would be in decimal. For example, I know that byte 18 to 21 is the latitude and byte 39 the battery. How can I decode it with python?
(In my city the latitude always starts in 39, that's what I said before this)
Thank you very much

You seem to be very confused about number bases. Please look up proper terminology, as well. For instance:
Hexadecimal values are usually represented with two hexadecimal values, like FF 2E 32 etc
I'm not sure what you're trying to say here. You seem to have noticed that hex values are often separated into bytes (each hex-digit is 4 bits), but the way you express it makes me wonder what you mean. The byte-level separation is not a "representation"; rather, it's a reading convenience.
Your central question seems to be how to find a particular decimal subsequence coded into only a subsequence of the hexidecimal version. Have you confused character coding (such as ASCII, UniCode, or EBCDIC) with simple base representation? Character coding allows you to make this sub-string conversion; changing number bases is not at all the same operation. For instance:
base 16 base 10
1 1
21 33 = 2 * 16 + 1
B21 2849 = 11 * 16*16 + 2 * 16 + 1
B2 178 = 11 * 16 + 2
B 11
There is no subsequence in one base that is coded with a subsequence in the other base. For instance, the "84" in base 10 is a notational feature of the entire hexadecimal number, not any subsequence.

Related

Representing multiple values with one character in Python

I have 2 values that are in the range 0-31. I want to be able to represent both of these values in 1 character (for example in base 64 to explain what I mean by 1 character) but still be able to know what both of the values are and which came first.
Find a nice Unicode block that has 1024 contiguous codepoints, for example CJK Unified Ideographs, and map your 32*32 values onto them. In Python 3:
def char_encode(a, b):
return chr(0x4E00 + a * 32 + b)
def char_decode(c):
return divmod(ord(c) - 0x4E00, 32)
print(char_encode(17, 3))
# => 倣
print(char_decode('倣'))
# => (17, 3)
As you mention Base64... this is impossible. Each character in a Base64 encoding only allows for 6 bits of data, and you need 10 to represent your two numbers.
And also note that while this is only one character, it takes up two or three bytes, depending on the encoding you use. As noted by others, there is no way to stuff 10 bits of data into an 8-bit byte.
Explanation: a * 32 + b simply maps two numbers in range [0, 32) into a single number in range [0, 1024). For example, 0 * 32 + 0 = 0; 31 * 32 + 31 = 1023. chr finds the Unicode character with that codepoint, but characters with low codepoints like 0 are not printable, and would be a poor choice, so the result is shifted to the beginning of a nice big Unicode block: 0x4E00 is a hexadecimal representation of 19968, and is the codepoint of the first character in the CJK Unified Ideographs block. Using the example values, 17 * 32 + 3 = 547 and 19968 + 547 = 20515, or 0x5023 in hexadecimal, which is the codepoint of the character 倣. Thus, chr(20515) = "倣".
The char_decode function just does all of these operations in reverse: if a * p + b = x, then a, b = divmod(x, p) (see divmod). If c = chr(x), then x = ord(c) (see ord). And I am sure you know that if w + r = y, then r = y - w. So in the example, ord("倣") = 20515; 20515 - 0x4E00 = 547; and divmod(547, 32) is (17, 3).
Values [0, 31] can be stored in 5 bits, since 2**5 == 32. You can therefore unambiguously store two such values in 10 bits. Conversely, you will not be able to unambiguously retrieve two 5-bit values from fewer than 10 bits unless some other conditions hold true.
If you are using an encoding that allows 1024 or more distinct characters, you can map your pairs to characters. Otherwise you simply can't. So ASCII is not going to work here, and neither is Latin1. But pretty much any of the "normal" Unicode encodings are fine.
Keep in mind that for something like UTF-8, the actual character will take up more than 10 bits. If that's a concern, consider using UTF-16 or so.

Base56 conversion etc

It seems base58 and base56 conversion treat input data as a single Big Endian number; an unsigned bigint number.
If I'm encoding some integers into shorter strings by trying to use base58 or base56 it seems in some implementations the integer is taken as a native (little endian in my case) representation of bytes and then converted to a string, while in other implementations the number is converted to big endian representation first. It seems the loose specifications of these encoding don't clarify which approach is right. Is there an explicit specification of which to do, or a more wildly popular option of the two I'm not aware of?
I was trying to compare some methods of making a short URL. The source is actually a 10 digit number that's less than 4 billion. In this case I was thinking to make it an unsigned 4 byte integer, possibly Little Endian, and then encode it with a few options (with alphabets):
base64 A…Za…z0…9+/
base64 url-safe A…Za…z0…9-_
Z85 0…9a…zA…Z.-:+=^!/*?&<>()[]{}#%$#
base58 1…9A…HJ…NP…Za…km…z (excluding 0IOl+/ from base64 & reordered)
base56 2…9A…HJ…NP…Za…kmnp…z (excluding 1o from base58)
So like, base16, base32 and base64 make pretty good sense in that they're taking 4, 5 or 6 bits of input data at a time and looking them up in an alphabet index. The latter uses 4 symbols per 3 bytes. Straightforward, and this works for any data.
The other 3 have me finding various implementations that disagree with each other as to the right output. The problem appears to be that no amount of bytes has a fixed number of lookups in these. EG taking 2^1 to 2^100 and getting the remainders for 56, 58 and 85 results in no remainders of 0.
Z85 (ascii85 and base85 etal.) approach this by grabbing 4 bytes at a time and encoding them to 5 symbols and accepting some waste. But there's byte alignment to some degree here (base64 has alignment per 16 symbols, Z85 gets there with 5). But the alphabet is … not great for urls, command-line, nor sgml/xml use.
base58 and base56 seem intent on treating the input bytes like a Big Endian ordered bigint and repeating: % base; lookup; -= % base; /= base on the input bigint. Which… I mean, I think that ends up modifying most of the input for every iteration.
For my input that's not a huge performance concern though.
Because we shouldn't treat the input as string data, or we get output longer than the 10 digit decimal number input and what's the point in that, does anyone know of any indication of which kind of processing for the output results in something canonical for base56 or base58?
Have the Little Endian 4 byte word of the 10 digit number (<4*10^10) turned into a sequence of bytes that represent a different number if Big Endian, and convert that by repeating the steps.
Have the 10 digit number (<4*10^10) represented in 4 bytes Big Endian before converting that by repeating the steps.
I'm leaning towards going the route of the 2nd way.
For example given the number: 3003295320
The little endian representation is 58 a6 02 b3
The big endian representation is b3 02 a6 58, Meaning
base64 gives:
>>> base64.b64encode(int.to_bytes(3003295320,4,'little'))
b'WKYCsw=='
>>> base64.b64encode(int.to_bytes(3003295320,4,'big'))
b'swKmWA=='
>>> base64.b64encode('3003295320'.encode('ascii'))
b'MzAwMzI5NTMyMA==' # Definitely not using this
Z85 gives:
>>> encode(int.to_bytes(3003295320,4,'little'))
b'sF=ea'
>>> encode(int.to_bytes(3003295320,4,'big'))
b'VJv1a'
>>> encode('003003295320'.encode('ascii')) # padding to 4 byte boundary
b'fFCppfF+EAh8v0w' # Definitely not using this
base58 gives:
>>> base58.b58encode(int.to_bytes(3003295320,4,'little'))
b'3GRfwp'
>>> base58.b58encode(int.to_bytes(3003295320,4,'big'))
b'5aPg4o'
>>> base58.b58encode('3003295320')
b'3soMTaEYSLkS4w' # Still not using this
base56 gives:
>>> b56encode(int.to_bytes(3003295320,4,'little'))
b'4HSgyr'
>>> b56encode(int.to_bytes(3003295320,4,'big'))
b'6bQh5q'
>>> b56encode('3003295320')
b'4uqNUbFZTMmT5y' # Longer than 10 digits so...

Reading 12bit little endian integers from can frame

I am reading in a series of CAN BUS frames from python-can represented as hex strings, e.g. '9819961F9FFF7FC1' and I know the values in each frame are laid out as follows:
Signal Startbit Length
A 0 8
B 8 4
C 12 4
D 16 12
E 28 12
F 40 16
G 56 4
With each value being an unsigned integer, with little endian byte order. Where I am struggling is how to deal with the 12 bit signals, and how to do it fast as this will be running in real time. As far as I understand struct.unpack only supports 1,2,4, and 8 byte integers. The Bitstring package also only supports whole-byte bitstrings when you specify the endianness.
I clearly don't understand binary well enough to do it by manipulating the bits directly because I have been tearing my hair out trying to get sensible values...
I was able to decode the frame successfully and reasonably quickly with the bitstruct library, which can handle values with any number of bits, as in the code below.
However I found I also had to swap the location of the hex characters if two signals are present on the same byte, as in the CAN frame layout. I'm still not sure why, but it does work.
swapped_frame = frame[0:2] + frame[3] + frame[2] + frame[4:6] + frame[7] + \
frame[6] + frame[8:]
ba = bytearray(swapped_frame.decode('hex'))
A,B,C,D,E,F,G = bitstruct.unpack('<u8u4u4u12u12u16u4', ba)

Python array.tostring - Explanation for the byte representation

I know that array.tostring gives the array of machine values. But I am trying to figure out how they are represented.
e.g
>>> a = array('l', [2])
>>> a.tostring()
'\x02\x00\x00\x00'
Here, I know that 'l' means each index will be min of 4 bytes and that's why we have 4 bytes in the tostring representation. But why is the Most significant bit populated with \x02. Shouldn't it be '\x00\x00\x00\x02'?
>>> a = array('l', [50,3])
>>> a.tostring()
'2\x00\x00\x00\x03\x00\x00\x00'
Here I am guessing the 2 in the beginning is because 50 is the ASCII value of 2, then why don't we have the corresponding char for ASCII value of 3 which is Ctrl-C
But why is the Most significant bit populated with \x02. Shouldn't it be '\x00\x00\x00\x02'?
The \x02 in '\x02\x00\x00\x00' is not the most significant byte. I guess you are confused by trying to read it as a hexadecimal number where the most significant digit is on the left. This is not how the string representation of an array returned by array.tostring() works. Bytes of the represented value are put together in a string left-to-right in the order from least significant to most significant. Just consider the array as a list of bytes, and the first (or, rather, 0th) byte is on the left, as is usual in regular python lists.
why don't we have the corresponding char for ASCII value of 3 which is Ctrl-C?
Do you have any example where python represents the character behind Ctrl-C as Ctrl-C or similar? Since the ASCII code 3 corresponds to an unprintable character and it has no corresponding escape sequence, hence it is represented through its hex code.

Data encoding and decoding using python

This is less of a programming question, and more of a question to understand what is what? I am not a CS major, and I am trying to understand the basic difference between these 3 formats :
1) EBCDIC 2) Unsigned binary number 3) Binary coded decimal
If this is not a real question, I apologize, but google was not very useful in explaining this to me
Say I have a string of numbers like "12890". What would their representation in
EBCDIC, Unsigned binary number and BCD format?
Is there a python 2.6 library I can use to simply convert any string of numbers to either of these formats?
For example, for string to ebcdic, I am doing
def encodeEbcdic(text):
return text.decode('latin1').encode('cp037')
print encodeEbcdic('AGNS')
But, I get this ┴╟╒Γ
EBCDIC is an IBM character encoding. It's meant for encoding text. Of course numerals can occur in text, as in "1600 Pennsylvania Avenue" so there are codes for numerals, too. To translate 1600 to EBCDIC, you need to find an EBCDIC table. Then you look up the code for 1, the code for 6, and the code for 0 (twice.) According to the table at http://www.astrodigital.org/digital/ebcdic.html
the EBCIDIC code for 0 through 9 are F0 through F9, respectively. This looks familiar, but I can't say I really remember.
An unsigned binary number is just that. It's the number written in base two. (See below.)
Binary-coded decimal (BCD) is an old format for storing the decimal representation of numbers on a digital computer. Each decimal digit is represented by its binary equivalent. Let's take 64 as an example. Since 64 is just 2 to the sixth power, in binary it's represented as a 1 followed by 6 0's: 1000000. In binary-coded decimal, we write the six in binary -- 0110 and the four in binary -- 0100 so that the BCD representation is 01100100. We need four bits for each digit, because the largest decimal digit, 9 works out to be 1001. BCD was used extensively in COBOL. If it's used anywhere else these days, I'm not familiar with the application.
Edit: I should have explained that F0, F1, etc. in EBCDIC are hex codes, so the F is 1111 and the digits are the same as in BCD. So, EBCDIC for numbers turns out to be the same as BCD, but with an extra 1111 before each digit.
saulspatz, thanks for your explanation. I was able to find out what are the necessary methods needed to convert any string of numbers into their different encoding. I had to refer to Effective Python Chapter 1, Item 3 : Know the Differences Between bytes, str, and unicode
And from there on, I read more about data types and such.
Anyway, to answer my questions :
1) String to EBCDIC:
def encode_ebcdic(text):
return text.decode('latin1').encode('cp037')
The encoding here is cp037 for USA. You can use cp500 for International. Here is a list of them :
https://en.wikipedia.org/wiki/List_of_EBCDIC_code_pages_with_Latin-1_character_set
2) Hexadecimal String to unsigned binary number :
def str_to_binary(text):
return int(str, 16)
This is pretty basic, just convert the Hexadecimal string to a number.
3) Hexadecimal string to Binary coded decimal:
def str_to_bcd(text):
return bytes(str).decode('hex')
Yes, you need to convert it to a byte array, so that BCD conversion can take place. Please read saulspatz answer for what BCD encoding is.

Categories

Resources