Data encoding and decoding using python - python

This is less of a programming question, and more of a question to understand what is what? I am not a CS major, and I am trying to understand the basic difference between these 3 formats :
1) EBCDIC 2) Unsigned binary number 3) Binary coded decimal
If this is not a real question, I apologize, but google was not very useful in explaining this to me
Say I have a string of numbers like "12890". What would their representation in
EBCDIC, Unsigned binary number and BCD format?
Is there a python 2.6 library I can use to simply convert any string of numbers to either of these formats?
For example, for string to ebcdic, I am doing
def encodeEbcdic(text):
return text.decode('latin1').encode('cp037')
print encodeEbcdic('AGNS')
But, I get this ┴╟╒Γ

EBCDIC is an IBM character encoding. It's meant for encoding text. Of course numerals can occur in text, as in "1600 Pennsylvania Avenue" so there are codes for numerals, too. To translate 1600 to EBCDIC, you need to find an EBCDIC table. Then you look up the code for 1, the code for 6, and the code for 0 (twice.) According to the table at http://www.astrodigital.org/digital/ebcdic.html
the EBCIDIC code for 0 through 9 are F0 through F9, respectively. This looks familiar, but I can't say I really remember.
An unsigned binary number is just that. It's the number written in base two. (See below.)
Binary-coded decimal (BCD) is an old format for storing the decimal representation of numbers on a digital computer. Each decimal digit is represented by its binary equivalent. Let's take 64 as an example. Since 64 is just 2 to the sixth power, in binary it's represented as a 1 followed by 6 0's: 1000000. In binary-coded decimal, we write the six in binary -- 0110 and the four in binary -- 0100 so that the BCD representation is 01100100. We need four bits for each digit, because the largest decimal digit, 9 works out to be 1001. BCD was used extensively in COBOL. If it's used anywhere else these days, I'm not familiar with the application.
Edit: I should have explained that F0, F1, etc. in EBCDIC are hex codes, so the F is 1111 and the digits are the same as in BCD. So, EBCDIC for numbers turns out to be the same as BCD, but with an extra 1111 before each digit.

saulspatz, thanks for your explanation. I was able to find out what are the necessary methods needed to convert any string of numbers into their different encoding. I had to refer to Effective Python Chapter 1, Item 3 : Know the Differences Between bytes, str, and unicode
And from there on, I read more about data types and such.
Anyway, to answer my questions :
1) String to EBCDIC:
def encode_ebcdic(text):
return text.decode('latin1').encode('cp037')
The encoding here is cp037 for USA. You can use cp500 for International. Here is a list of them :
https://en.wikipedia.org/wiki/List_of_EBCDIC_code_pages_with_Latin-1_character_set
2) Hexadecimal String to unsigned binary number :
def str_to_binary(text):
return int(str, 16)
This is pretty basic, just convert the Hexadecimal string to a number.
3) Hexadecimal string to Binary coded decimal:
def str_to_bcd(text):
return bytes(str).decode('hex')
Yes, you need to convert it to a byte array, so that BCD conversion can take place. Please read saulspatz answer for what BCD encoding is.

Related

Base56 conversion etc

It seems base58 and base56 conversion treat input data as a single Big Endian number; an unsigned bigint number.
If I'm encoding some integers into shorter strings by trying to use base58 or base56 it seems in some implementations the integer is taken as a native (little endian in my case) representation of bytes and then converted to a string, while in other implementations the number is converted to big endian representation first. It seems the loose specifications of these encoding don't clarify which approach is right. Is there an explicit specification of which to do, or a more wildly popular option of the two I'm not aware of?
I was trying to compare some methods of making a short URL. The source is actually a 10 digit number that's less than 4 billion. In this case I was thinking to make it an unsigned 4 byte integer, possibly Little Endian, and then encode it with a few options (with alphabets):
base64 A…Za…z0…9+/
base64 url-safe A…Za…z0…9-_
Z85 0…9a…zA…Z.-:+=^!/*?&<>()[]{}#%$#
base58 1…9A…HJ…NP…Za…km…z (excluding 0IOl+/ from base64 & reordered)
base56 2…9A…HJ…NP…Za…kmnp…z (excluding 1o from base58)
So like, base16, base32 and base64 make pretty good sense in that they're taking 4, 5 or 6 bits of input data at a time and looking them up in an alphabet index. The latter uses 4 symbols per 3 bytes. Straightforward, and this works for any data.
The other 3 have me finding various implementations that disagree with each other as to the right output. The problem appears to be that no amount of bytes has a fixed number of lookups in these. EG taking 2^1 to 2^100 and getting the remainders for 56, 58 and 85 results in no remainders of 0.
Z85 (ascii85 and base85 etal.) approach this by grabbing 4 bytes at a time and encoding them to 5 symbols and accepting some waste. But there's byte alignment to some degree here (base64 has alignment per 16 symbols, Z85 gets there with 5). But the alphabet is … not great for urls, command-line, nor sgml/xml use.
base58 and base56 seem intent on treating the input bytes like a Big Endian ordered bigint and repeating: % base; lookup; -= % base; /= base on the input bigint. Which… I mean, I think that ends up modifying most of the input for every iteration.
For my input that's not a huge performance concern though.
Because we shouldn't treat the input as string data, or we get output longer than the 10 digit decimal number input and what's the point in that, does anyone know of any indication of which kind of processing for the output results in something canonical for base56 or base58?
Have the Little Endian 4 byte word of the 10 digit number (<4*10^10) turned into a sequence of bytes that represent a different number if Big Endian, and convert that by repeating the steps.
Have the 10 digit number (<4*10^10) represented in 4 bytes Big Endian before converting that by repeating the steps.
I'm leaning towards going the route of the 2nd way.
For example given the number: 3003295320
The little endian representation is 58 a6 02 b3
The big endian representation is b3 02 a6 58, Meaning
base64 gives:
>>> base64.b64encode(int.to_bytes(3003295320,4,'little'))
b'WKYCsw=='
>>> base64.b64encode(int.to_bytes(3003295320,4,'big'))
b'swKmWA=='
>>> base64.b64encode('3003295320'.encode('ascii'))
b'MzAwMzI5NTMyMA==' # Definitely not using this
Z85 gives:
>>> encode(int.to_bytes(3003295320,4,'little'))
b'sF=ea'
>>> encode(int.to_bytes(3003295320,4,'big'))
b'VJv1a'
>>> encode('003003295320'.encode('ascii')) # padding to 4 byte boundary
b'fFCppfF+EAh8v0w' # Definitely not using this
base58 gives:
>>> base58.b58encode(int.to_bytes(3003295320,4,'little'))
b'3GRfwp'
>>> base58.b58encode(int.to_bytes(3003295320,4,'big'))
b'5aPg4o'
>>> base58.b58encode('3003295320')
b'3soMTaEYSLkS4w' # Still not using this
base56 gives:
>>> b56encode(int.to_bytes(3003295320,4,'little'))
b'4HSgyr'
>>> b56encode(int.to_bytes(3003295320,4,'big'))
b'6bQh5q'
>>> b56encode('3003295320')
b'4uqNUbFZTMmT5y' # Longer than 10 digits so...

Performing right shift and bit masking on binary fraction in python

I am looking for a way in python to perform right shift and bit masking on a binary number which has a fraction part as well. For e.g., if there are 1 integer and 2 fraction bits in the number, then number 0b101 corresponds to 1.25 in decimal. First, I want to know the pythonic way to represent this number in python.
Second, I want to perform 1 right shift on this number (0b101>>1) so that the resultant number will be 0b010 which will be 0.5 in decimal. Is there an intrinsic and pythonic way in python to perform this operation. Similarly, how to mask and get a specific bit from the binary number?
Presently, for shift I am multiplying the number by 2**-x, x is the number of right shifts. I cannot think a similar operation I can perform for bit mask.
If you really must get directly at the internal representation of a float you can use struct, like this:
>>> import struct
>>> a = 1.25
>>> b = struct.pack('>d',a)
>>> b
b'?\xf4\x00\x00\x00\x00\x00\x00' # the ? means \x3f, leftmost 7 bits of exponent
>>> a.hex()
'0x1.4000000000000p+0'
You can mask the bit you want out of the bytestring that struct.pack() returns.
[edit] The question mark representing \x3f is because the default output representation of a bytestring is a string and Python will where possible show an ascii character, not two hex digits.
[edit] This representation is in principle platform-dependent, but in practice it isn't, because virtually every computer (even IBM mainframes nowadays) has a floating-point processor that uses this format.
Finding out which bit you want may be something of a challenge.
>>> c = struct.pack('>d',a/2)
>>> c
b'?\xe4\x00\x00\x00\x00\x00\x00'
>>> (a/2).hex()
'0x1.4000000000000p-1'
As you can see, division by 2 is not quite the simple one-bit shift to the right that your question seems to suggest you are expecting. In this case, the division by 2 has decremented the exponent by 1 (from 0x3ff to 0x3fe; 1023 to 1022) and left the bit pattern of the fraction (0x4000) unchanged. The exponent appears large because it is biased by 1023.
The main difficulties are
Sign, exponent and fraction don't align to byte boundaries, but to nybble boundaries (sign plus exponent: 12 bits; fraction: 52 bits)
The number is normalized so that it has no leading zeroes (much as scientific notation in decimal is normalized so that it has no leading zeroes) and, since everyone knows it's there, the leading 1 is not stored.
I can recommend the Wikpedia article on this subject: it has lots of useful examples.
But I suspect that you don't really want to get at the internal representation of a float. Instead, you want a fixed-point binary class, without pesky binary exponents, that works much the same as you would do it on paper, and where division by a power of 2 really does reflect as a shift of so many bits to the right.
Depending on how much work you want to put into it, you could do this by defining a FixedBinary class as a subclass of numbers.Real, with the integer portion internally represented by one int and the fractional component by another int, and the sign by a third int, so that 1.25 would be represented as (1, int(0.25 * 65536), +1) (or some other power of 2).
This also shows you the simplest way to get a bit representation of your fraction.
[edit] I recommend storing the sign separately. You could store it in the integer portion, or the fraction, or both, but all have disadvantages.
If you store it in the sign of the fraction, the twos-complement
representation of negative integers will give you difficulty when
you want to mask your bits.
If you don't store it in the sign of the fraction there will be no way to
represent -0.5.
If you don't store it in the sign of the integer portion there will be no way to represent -1.0.
A multiplicand of 65536 will give you 4 decimal digits of accuracy. You can increase it if you want more. I also recommend that you store your fraction in the rightmost bits and simply ignore the leftmost bits. In other words, be content with the binary point being in the middle of the int, don't insist on it being on the left. That is because you will need headroom to the left of the binary point when you do multiplication.
Implementing your own numeric class is a considerable amount of work, though.
You can work using fxpmath.
Info about this package is at:
https://github.com/francof2a/fxpmath
For your example:
from fxpmath import Fxp
x = Fxp('0b0101', signed=True, n_word=4, n_frac=2)
print(x)
y = x >> 1
print(y)
# example of AND mask
z = x & Fxp('0b0110', signed=True, n_word=4, n_frac=2)
print(z.bin())
outputs:
1.25
0.5
0100

Python define bitwise

I have a function that accepts 'data' as a parameter. Being new to python I wasn't really sure that that was even a type.
I noticed when printing something of that type it would be
b'h'
if I encoded the letter h. Which dosen't make a ton of sense to me. Is there a way to define bits in python, such as 1 or 0. I guess b'h' must be in hex? Is there a way for me to simply define an eight bit string
bits1 = 10100000
You're conflating a number of unrelated things.
First of all, (in Python 3), quoted literals prefixed with b are of type bytes -- that means a string of raw byte values. Example:
x = b'abc'
print(type(x)) # will output `<class 'bytes'>`
This is in contrast to the str type, which is a (Unicode) string.
Integer literals can be expressed in binary using an 0b prefix, e.g.
y = 0b10100000
print(y) # Will output 160
For what I know, 'data' is not a type. Your function (probably) accepts anything you pass to it, regardless of its type.
Now, b'h' means "the number (int) whose binary sequence maps to the char ´h´", this is not hexadecimal, but a number with possibly 8 bits (1 byte, which is the standard size for int and char).
The ASCII code for ´h´ is 104 (decimal), written in binary that would be b'\b01101000', or in hexa b'\x68'.
So, here is the answer I think you are looking for: if you want to code an 8-bit int from its binary representation just type b'\b01101000' (for 104). I would recommend to use hexa instead, to make it more compact and readable. In hexa, every four bits make a symbol from 0 to f, and the symbols can be concatenated every four bits to form a larger number. So the bit sequence 01101000 is written b'\b0110\b1000' or b'\x6\x8', which can be written as b'\x68'. The preceding b, before the quote marks tells python to interpret the string as a binary sequence expressed in the base defined by \b or \x (or \d for decimal), instead of using escape characters.

Set fixed length integer in python

After a bit of googling, nothing came up. I am manipulating sequence numbers for network packets and need the numbers to be of a fixed length. For example:
>>> 0000 + 1
1
Instead, I'd like the integer that is returned to be 0001. Are there any built-in commands for setting an integer of fixed length?
Edit: I do not need to print these integers, I need to actually manipulate them. I will need them to iterate but they must be fixed length so that they can be easily found in a networking protocol head file.
What you're asking doesn't make any sense. The integer 0011 and the integer 11 are exactly the same number.*
If you want to format them as strings to print them out or to search a text file, you can do that with, e.g., format(n, '04'). It doesn't matter whether you're formatting 11 or 0011, they're both the same number, and that number will format to the string '0011'.
If you want to convert them to big-endian 32-bit C-style unsigned integers, again, they're both the same number, and struct.pack('>I', n) will pack that number to the byte string b'\x00\x00\x00\x0b'.
If you want to add them modulo 10000, again, they're both the same number, and (n + 9990) % 10000 will give you 1.
No matter what operation you dream up, there will be no difference.
* Actually, in Python 2.x, number literals starting with 0 are treated as octal, not decimal, so 0011 is actually 9, not 11. And in 3.x numbers starting with 0 are a SyntaxError, to avoid the confusion caused by accidentally writing octal numbers. But forget all that. We're not talking about the Python number literals, we're talking about something even simpler here: the numbers themselves.
Numbers don't have a "length", they're just numbers. The representation of a number as text, in a string, has a length. To convert numbers to strings in Python, use the format() function:
x = 1
s = "{:04d}".format(x)
print(s)

How to compute a double precision float score from the first 8 bytes of a string in Python?

Trying to get a double-precision floating point score from a UTF-8 encoded string object in Python. The idea is to grab the first 8 bytes of the string and create a float, so that the strings, ordered by their score, would be ordered lexicographically according to their first 8 bytes (or possibly their first 63 bits, after forcing them all to be positive to avoid sign errors).
For example:
get_score(u'aaaaaaa') < get_score(u'aaaaaaab') < get_score(u'zzzzzzzz')
I have tried to compute the score in an integer using bit-shift-left and XOR, but I am not sure of how to translate that into a float value. I am also not sure if there is a better way to do this.
How should the score for a string be computed so the condition I specified before is met?
Edit: The string object is UTF-8 encoded (as per #Bakuriu's commment).
float won't give you 64 bits of precision. Use integers instead.
def get_score(s):
return struct.unpack('>Q', (u'\0\0\0\0\0\0\0\0' + s[:8])[-8:])[0]
In Python 3:
def get_score(s):
return struct.unpack('>Q', ('\0\0\0\0\0\0\0\0' + s[:8])[-8:].encode('ascii', 'error'))[0]
EDIT:
For floats, with 6 characters:
def get_score(s):
return struct.unpack('>d', (u'\0\1' + (u'\0\0\0\0\0\0\0\0' + s[:6])[-6:]).encode('ascii', 'error'))[0]
You will need to setup the entire alphabet and do the conversion by hand, since conversions to base > 36 are not built in, in order to do that you only need to define the complete alphabet to use. If it was an ascii string for instance you would create a conversion to a long in base 256 from the input string using all the ascii table as an alphabet.
You have an example of the full functions to do it here: string to base 62 number
Also you don't need to worry about negative-positive numbers when doing this, since the encoding of the string with the first character in the alphabet will yield the minimum possible number in the representation, which is the negative value with the highest absolute value, in your case -2**63 which is the correct value and allows you to use < > against it.
Hope it helps!

Categories

Resources