I have a function that accepts 'data' as a parameter. Being new to python I wasn't really sure that that was even a type.
I noticed when printing something of that type it would be
b'h'
if I encoded the letter h. Which dosen't make a ton of sense to me. Is there a way to define bits in python, such as 1 or 0. I guess b'h' must be in hex? Is there a way for me to simply define an eight bit string
bits1 = 10100000
You're conflating a number of unrelated things.
First of all, (in Python 3), quoted literals prefixed with b are of type bytes -- that means a string of raw byte values. Example:
x = b'abc'
print(type(x)) # will output `<class 'bytes'>`
This is in contrast to the str type, which is a (Unicode) string.
Integer literals can be expressed in binary using an 0b prefix, e.g.
y = 0b10100000
print(y) # Will output 160
For what I know, 'data' is not a type. Your function (probably) accepts anything you pass to it, regardless of its type.
Now, b'h' means "the number (int) whose binary sequence maps to the char ´h´", this is not hexadecimal, but a number with possibly 8 bits (1 byte, which is the standard size for int and char).
The ASCII code for ´h´ is 104 (decimal), written in binary that would be b'\b01101000', or in hexa b'\x68'.
So, here is the answer I think you are looking for: if you want to code an 8-bit int from its binary representation just type b'\b01101000' (for 104). I would recommend to use hexa instead, to make it more compact and readable. In hexa, every four bits make a symbol from 0 to f, and the symbols can be concatenated every four bits to form a larger number. So the bit sequence 01101000 is written b'\b0110\b1000' or b'\x6\x8', which can be written as b'\x68'. The preceding b, before the quote marks tells python to interpret the string as a binary sequence expressed in the base defined by \b or \x (or \d for decimal), instead of using escape characters.
Related
import sys
for i in range(30):
# a = int(str(i),base = 16).to_bytes(4,sys.byteorder)
a = i.to_bytes(4,sys.byteorder)
print(a)
Here sys.byteorder seems to be 'little'. The output of the above code is:
b'\x00\x00\x00\x00'
b'\x01\x00\x00\x00'
b'\x02\x00\x00\x00'
b'\x03\x00\x00\x00'
b'\x04\x00\x00\x00'
b'\x05\x00\x00\x00'
b'\x06\x00\x00\x00'
b'\x07\x00\x00\x00'
b'\x08\x00\x00\x00'
b'\t\x00\x00\x00'
b'\n\x00\x00\x00'
b'\x0b\x00\x00\x00'
b'\x0c\x00\x00\x00'
b'\r\x00\x00\x00'
b'\x0e\x00\x00\x00'
b'\x0f\x00\x00\x00'
b'\x10\x00\x00\x00'
b'\x11\x00\x00\x00'
b'\x12\x00\x00\x00'
b'\x13\x00\x00\x00'
b'\x14\x00\x00\x00'
b'\x15\x00\x00\x00'
b'\x16\x00\x00\x00'
b'\x17\x00\x00\x00'
b'\x18\x00\x00\x00'
b'\x19\x00\x00\x00'
b'\x1a\x00\x00\x00'
b'\x1b\x00\x00\x00'
b'\x1c\x00\x00\x00'
b'\x1d\x00\x00\x00'
Observe integer 9 here is written obnoxiously as b'\t\x00\x00\x00' along with similar oddities such as 0xa and 0xd.
Is this an aberration or am I lacking knowledge of these notation?
My Python version is 3.8.2.
These are escape sequences.
\t represents an ASCII Horizontal Tab (TAB) and \r represents an ASCII Carriage Return (CR).
See Python's documentation of String and Bytes literals.
I think part of the problem is that you are using bytes in two senses. It can mean a datatype and it can mean a representation. And you are expecting that a variable of datatype byte will have a particular byte representation.
Let's begin by looking at these equivalences:
>>> b"\x09\x0a\x0b\x0c\x0d\x0e" == b"\t\n\x0b\x0c\r\x0e" == bytes([9,10,11,12,13,14])
True
As you can see, even though the representations of these 6 bytes in Python code differ, the data is the same. The middle one is Python's default representation if you just call print() on a bunch of bytes.
If you only care about seeing the integer values 0 to 29 displayed as 2 hex digits, then all you need to do is format the integers as 2 hex digits, like this:
for i in range(30):
print (f"{i:02x}")
00
01
02
03
...
1b
1c
1d
If you want a leading 0x then put it in the f-string before the opening brace.
You can't actually convert your integer value to datatype byte (which is what I think you may have been trying to do with the call to to_bytes()) because Python doesn't have a byte datatype. to_bytes() returns a bytes, which behaves at the Python level like a list of integers in the range 0–255, and its default on-screen representation is a bytestring.
Formatting only affects how the values appear on the screen. If you want the hex representation back in a variable (because you are writing a hex editor, say, and need to manipulate the appearance in your own code), then, as #Harmon758 says, use the hex() function:
for i in range(30):
h = hex(i)
print (h)
This gives the same output as print (f"0x{i:02x}"), but it is not doing the same thing, because h is not an integer, it is a string of length 4. Only the screen representation is the same. If you want the string to look a bit different (a capital X, for example, or 4 leading zeroes) you can use an f-string instead of calling hex():
>>> i = 29
>>> h = f"0X{i:04x}"
>>> h
'0X001d'
>>> h = f"0X{i:04X}"
>>> h
'0X001D'
I know that array.tostring gives the array of machine values. But I am trying to figure out how they are represented.
e.g
>>> a = array('l', [2])
>>> a.tostring()
'\x02\x00\x00\x00'
Here, I know that 'l' means each index will be min of 4 bytes and that's why we have 4 bytes in the tostring representation. But why is the Most significant bit populated with \x02. Shouldn't it be '\x00\x00\x00\x02'?
>>> a = array('l', [50,3])
>>> a.tostring()
'2\x00\x00\x00\x03\x00\x00\x00'
Here I am guessing the 2 in the beginning is because 50 is the ASCII value of 2, then why don't we have the corresponding char for ASCII value of 3 which is Ctrl-C
But why is the Most significant bit populated with \x02. Shouldn't it be '\x00\x00\x00\x02'?
The \x02 in '\x02\x00\x00\x00' is not the most significant byte. I guess you are confused by trying to read it as a hexadecimal number where the most significant digit is on the left. This is not how the string representation of an array returned by array.tostring() works. Bytes of the represented value are put together in a string left-to-right in the order from least significant to most significant. Just consider the array as a list of bytes, and the first (or, rather, 0th) byte is on the left, as is usual in regular python lists.
why don't we have the corresponding char for ASCII value of 3 which is Ctrl-C?
Do you have any example where python represents the character behind Ctrl-C as Ctrl-C or similar? Since the ASCII code 3 corresponds to an unprintable character and it has no corresponding escape sequence, hence it is represented through its hex code.
I'm trying to write my "personal" python version of STL binary file reader, according to WIKIPEDIA : A binary STL file contains :
an 80-character (byte) headern which is generally ignored.
a 4-byte unsigned integer indicating the number of triangular facets in the file.
Each triangle is described by twelve 32-bit floating-point numbers: three for the normal and then three for the X/Y/Z coordinate of each vertex – just as with the ASCII version of STL. After these follows a 2-byte ("short") unsigned integer that is the "attribute byte count" – in the standard format, this should be zero because most software does not understand anything else. --Floating-point numbers are represented as IEEE floating-point numbers and are assumed to be little-endian--
Here is my code :
#! /usr/bin/env python3
with open("stlbinaryfile.stl","rb") as fichier :
head=fichier.read(80)
nbtriangles=fichier.read(4)
print(nbtriangles)
The output is :
b'\x90\x08\x00\x00'
It represents an unsigned integer, I need to convert it without using any package (struct,stl...). Are there any (basic) rules to do it ?, I don't know what does \x mean ? How does \x90 represent one byte ?
most of the answers in google mention "C structs", but I don't know nothing about C.
Thank you for your time.
Since you're using Python 3, you can use int.from_bytes. I'm guessing the value is stored little-endian, so you'd just do:
nbtriangles = int.from_bytes(fichier.read(4), 'little')
Change the second argument to 'big' if it's supposed to be big-endian.
Mind you, the normal way to parse a fixed width type is the struct module, but apparently you've ruled that out.
For the confusion over the repr, bytes objects will display ASCII printable characters (e.g. a) or standard ASCII escapes (e.g. \t) if the byte value corresponds to one of them. If it doesn't, it uses \x##, where ## is the hexadecimal representation of the byte value, so \x90 represents the byte with value 0x90, or 144. You need to combine the byte values at offsets to reconstruct the int, but int.from_bytes does this for you faster than any hand-rolled solution could.
Update: Since apparent int.from_bytes isn't "basic" enough, a couple more complex, but only using top-level built-ins (not alternate constructors) solutions. For little-endian, you can do this:
def int_from_bytes(inbytes):
res = 0
for i, b in enumerate(inbytes):
res |= b << (i * 8) # Adjust each byte individually by 8 times position
return res
You can use the same solution for big-endian by adding reversed to the loop, making it enumerate(reversed(inbytes)), or you can use this alternative solution that handles the offset adjustment a different way:
def int_from_bytes(inbytes):
res = 0
for b in inbytes:
res <<= 8 # Adjust bytes seen so far to make room for new byte
res |= b # Mask in new byte
return res
Again, this big-endian solution can trivially work for little-endian by looping over reversed(inbytes) instead of inbytes. In both cases inbytes[::-1] is an alternative to reversed(inbytes) (the former makes a new bytes in reversed order and iterates that, the latter iterates the existing bytes object in reverse, but unless it's a huge bytes object, enough to strain RAM if you copy it, the difference is pretty minimal).
The typical way to interpret an integer is to use struct.unpack, like so:
import struct
with open("stlbinaryfile.stl","rb") as fichier :
head=fichier.read(80)
nbtriangles=fichier.read(4)
print(nbtriangles)
nbtriangles=struct.unpack("<I", nbtriangles)
print(nbtriangles)
If you are allergic to import struct, then you can also compute it by hand:
def unsigned_int(s):
result = 0
for ch in s[::-1]:
result *= 256
result += ch
return result
...
nbtriangles = unsigned_int(nbtriangles)
As to what you are seeing when you print b'\x90\x08\x00\x00'. You are printing a bytes object, which is an array of integers in the range [0-255]. The first integer has the value 144 (decimal) or 90 (hexadecimal). When printing a bytes object, that value is represented by the string \x90. The 2nd has the value eight, represented by \x08. The 3rd and final integers are both zero. They are presented by \x00.
If you would like to see a more familiar representation of the integers, try:
print(list(nbtriangles))
[144, 8, 0, 0]
To compute the 32-bit integers represented by these four 8-bit integers, you can use this formula:
total = byte0 + (byte1*256) + (byte2*256*256) + (byte3*256*256*256)
Or, in hex:
total = byte0 + (byte1*0x100) + (byte2*0x10000) + (byte3*0x1000000)
Which results in:
0x00000890
Perhaps you can see the similarities to decimal, where the string "1234" represents the number:
4 + 3*10 + 2*100 + 1*1000
After a bit of googling, nothing came up. I am manipulating sequence numbers for network packets and need the numbers to be of a fixed length. For example:
>>> 0000 + 1
1
Instead, I'd like the integer that is returned to be 0001. Are there any built-in commands for setting an integer of fixed length?
Edit: I do not need to print these integers, I need to actually manipulate them. I will need them to iterate but they must be fixed length so that they can be easily found in a networking protocol head file.
What you're asking doesn't make any sense. The integer 0011 and the integer 11 are exactly the same number.*
If you want to format them as strings to print them out or to search a text file, you can do that with, e.g., format(n, '04'). It doesn't matter whether you're formatting 11 or 0011, they're both the same number, and that number will format to the string '0011'.
If you want to convert them to big-endian 32-bit C-style unsigned integers, again, they're both the same number, and struct.pack('>I', n) will pack that number to the byte string b'\x00\x00\x00\x0b'.
If you want to add them modulo 10000, again, they're both the same number, and (n + 9990) % 10000 will give you 1.
No matter what operation you dream up, there will be no difference.
* Actually, in Python 2.x, number literals starting with 0 are treated as octal, not decimal, so 0011 is actually 9, not 11. And in 3.x numbers starting with 0 are a SyntaxError, to avoid the confusion caused by accidentally writing octal numbers. But forget all that. We're not talking about the Python number literals, we're talking about something even simpler here: the numbers themselves.
Numbers don't have a "length", they're just numbers. The representation of a number as text, in a string, has a length. To convert numbers to strings in Python, use the format() function:
x = 1
s = "{:04d}".format(x)
print(s)
Trying to get a double-precision floating point score from a UTF-8 encoded string object in Python. The idea is to grab the first 8 bytes of the string and create a float, so that the strings, ordered by their score, would be ordered lexicographically according to their first 8 bytes (or possibly their first 63 bits, after forcing them all to be positive to avoid sign errors).
For example:
get_score(u'aaaaaaa') < get_score(u'aaaaaaab') < get_score(u'zzzzzzzz')
I have tried to compute the score in an integer using bit-shift-left and XOR, but I am not sure of how to translate that into a float value. I am also not sure if there is a better way to do this.
How should the score for a string be computed so the condition I specified before is met?
Edit: The string object is UTF-8 encoded (as per #Bakuriu's commment).
float won't give you 64 bits of precision. Use integers instead.
def get_score(s):
return struct.unpack('>Q', (u'\0\0\0\0\0\0\0\0' + s[:8])[-8:])[0]
In Python 3:
def get_score(s):
return struct.unpack('>Q', ('\0\0\0\0\0\0\0\0' + s[:8])[-8:].encode('ascii', 'error'))[0]
EDIT:
For floats, with 6 characters:
def get_score(s):
return struct.unpack('>d', (u'\0\1' + (u'\0\0\0\0\0\0\0\0' + s[:6])[-6:]).encode('ascii', 'error'))[0]
You will need to setup the entire alphabet and do the conversion by hand, since conversions to base > 36 are not built in, in order to do that you only need to define the complete alphabet to use. If it was an ascii string for instance you would create a conversion to a long in base 256 from the input string using all the ascii table as an alphabet.
You have an example of the full functions to do it here: string to base 62 number
Also you don't need to worry about negative-positive numbers when doing this, since the encoding of the string with the first character in the alphabet will yield the minimum possible number in the representation, which is the negative value with the highest absolute value, in your case -2**63 which is the correct value and allows you to use < > against it.
Hope it helps!