I'm trying to write a single-hex value, say 'F' to a file:
a = int('F', 16)
f.write(chr(a))
However, this code segment gives me the file with 0F. I just want the single hex F in my file. I know this is because a char is represented by a byte, is there a way to directly write the hex value without the pad?
Use string formatting:
f.write("{:X}".format(a))
It will write it as F:
>>> "{:X}".format(a)
'F'
What you are trying to do is not possible on most modern operating systems. The smallest data unit that a general purpose computing platform can handle is is one byte.
Check this wiki article for additional details where in it it states:
"Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable unit of memory in many computer architectures. "
You can use the struct module to write raw data to a file. This will write a single byte to the file
open('file','wb').write(struct.pack('b', 0xf))
Related
I am experimenting with binary reading and writing to and from files in Python. I am trying to teach myself a bit of programming (it's not really teaching myself, since I use the internet, but anyway...). My problem is that reading a file in Python in binary does not actually output the bits to me, but seems to process it into text already.
Example:
My system has a file "Test.txt" in the same folder as the script.
The content of this file is the following text written in notepad:
Testing Temp "Testing"
This is a small piece of the code that is giving me some confusion:
f=open("Test.txt", "rb")
print(f.read(22))
This results in the following output:
b'Testing Temp "Testing"'
However, I want bits in the form of a string (so a string of 0's and 1's) as output. How can I do this?
What you have is a sequence of bytes (note the b at the beginning).
You can access the value of every single byte using indexing. In your example, if s=f.read(22) then s[0] will be 84 which is the ASCII code for T.
If you want to obtain the binary representation of a byte you use the bin built-in:
>>> bin(84)
'0b1010100'
It also adds the 0b prefix which is python's prefix for binary literals:
>>> 0b1010100
84
To obtain the bit-per-bit binary representation you can simply access every byte and call bin on each value:
def to_bits(contents):
return ''.join(bin(byte)[2:].zfill(8) for byte in contents)
which results in:
>>> to_bits(b'Testing Temp "Testing"')
'01010100011001010111001101110100011010010110111001100111001000000101010001100101011011010111000000100000001000100101010001100101011100110111010001101001011011100110011100100010'
Note that you have to call zfill(8) because bin can return representation shorter than 8 bits:
>>> bin(1)[2:]
'1'
>>> bin(1)[2:].zfill(8)
'00000001'
I have big data hex files from which I need to compare some hex values.When i read through python read it automatically converts it into ascii and so I have to decode it again.How can i directly read file in hex??
Till now i have tried using Intelhex python package but it is throwing an error :
intelhex.HexRecordError: Hex files contain invalid record.So is there any issues with my files only?
How much performance difference it is going to make if I successfully read hex data without decoding
split file into hex words consisting of purely [0-9a-fA-F] characters then int(word, 16) will change a word to a normal python integer. You can directly compare integers.
Alternatively you can keep the hex words and then convert an integer to a hex string using '{0:x}'.format(someinteger), prior to comparing the hex strings.
>>> s = open('input_file', 'rb').read(10)
>>> s
'\x00\x00\x00\x02\x00\xe6\x00\xa1I\x8d'
It is an ordinary sequence of bytes. If a byte is in ascii range then it is shown as the corresponding character in the representation e.g.,s[-2] == 'I'. The byte is the same (73 in decimal form), it is just shown in a human readable form.
You don't need to do any conversion to compare bytestrings (a[2:10] == b[4:12] works). Python does not decode your files to hex, ascii, or anything else unless you ask. Just make sure you open the files in binary mode (rb).
In Python 3, when I opened a text file with mode string 'rb', and then did f.read(), I was taken aback to find the file contents enclosed in single quotes after the character 'b'.
In Python 2 I just get the file contents.
I'm sure this is well known, but I can't find anything about it in the doco. Could someone point me to it?
You get "just the file contents" in Python 3 as well. Most likely you can just keep on doing whatever you were doing anyway. Read on for a longer explanation:
The b'' signifies that the result value is a bytes string. A bytes-string is quite similar to a normal string, but not quite, and is used to handle binary, non-textual data.
Some of the methods on a string that doesn't make sense for binary data is gone, but most are still there. A big difference is that when you get a specific byte from a bytes string you get an integer back, while for a normal str you get a one-length str.
>>> b'foo'[1]
111
>>> 'foo'[1]
'o'
If you open the file in text mode with the 't' flag you get a str back. The Python 3 str is what in Python 2 was called unicode. It's used to handle textual data.
You convert back and forth between bytes and str with the .encode() and .decode methods.
First of all, the Python 2 str type has been renamed to bytes in Python 3, and byte literals use the b'' prefix. The Python 2 unicode type is the new Python 3 str type.
To get the Python 3 file behaviour in Python 2, you'd use io.open() or codecs.open(); Python 3 decodes text files to Unicode by default.
What you see is that for binary files, Python 3 gives you the exact same thing as in Python 2, namely byte strings. What changed then, is that the repr() of a byte string is prefixed with b and the print() function will use the repr() representation of any object passed to it except for unicode values.
To print your binary data as unicode text with the print() function., decode it to unicode first. But then you could perhaps have opened the file as a text file instead anyway.
The bytes type has some other improvements to reflect that you are dealing with binary data, not text. Indexing individual bytes or iterating over a bytes value gives you int values (between 0 and 255) and not characters, for example.
Sometimes we need (needed?) to know whether a text file had single-character newlines (0A) or double character newlines (0D0A).
We used to avoid confusion by opening the text file in binary mode, recognising 0D and 0A, and treating other bytes as regular text characters.
One could port such code by finding all binarymode reads and replacing them with a new function oldread() that stripped off the added material, but it’s a bit painful.
I suppose the Python theologians thought of keeping ‘rb’ as it was, and adding a new ‘rx’ or something for the new behaviour. It seems a bit high-handed just to abolish something.
But, there it is, the question is certainly answered by a search for ‘rb’ in Lennert’s document.
I am retrieving a value that is set by another application from memcached using python-memcached library. But unfortunately this is the value that I am getting:
>>> mc.get("key")
'\x04\x08"\nHello'
Is it possible to parse this mixed ASCII code into plain string using python function?
Thanks heaps for your help
It is a "plain string", to the extent that such a thing exists. I have no idea what kind of output you're expecting, but:
There ain't no such thing as plain text.
The Python (in 2.x, anyway) str type is really a container for bytes, not characters. So it isn't really text in the first place :) It displays the bytes assuming a very simple encoding, using escape sequence to represent every byte that's even slightly "weird". It will be formatted differently again if you print the string (what you're seeing right now is syntax for creating such a literal string in your code).
In simpler times, we naively assumed that we could just map bytes to these symbols we call "characters", and that would be that. Then it turned out that there were approximately a zillion different mappings that people wanted to use, and lots of them needed more symbols than a byte could represent. Which is why we have Unicode now: it represents every symbol you could conceivably need for any real-world language (and several for fake languages and other purposes), and it abstractly assigns numbers to those symbols but does not say how to collect and interpret the bytes as numbers. (That is the purpose of the encoding).
If you know that the string data is encoded in a particular way, you can decode it to a Unicode string. It could either be an encoding of actual Unicode data, or it could be in some other format (for example, Japanese text is often found in something called "Shift-JIS", because it has approximately the same significance to them as "Latin-1" - a common extension of ASCII - does to us). Either way, you get an in-memory representation of a series of Unicode code points (the numbers referred to in the previous paragraph). This, for all intents and purposes, is really "text", but it isn't really "plain" :)
But it looks like the data you have is really a binary blob of bytes that simply happens to consist mostly of "readable text" if interpreted as ASCII.
What you really need to do is figure out why the first byte has a value of 4 and the next byte has a value of 8, and proceed accordingly.
If you just need to trim the '\x04\x08"\n', and it's always the same (you haven't put your question very clearly, I'm not certain if that's what it is or what you want), do something like this:
to_trim = '\x04\x08"\n'
string = mc.get('key')
if string.startswith(to_trim):
string = string[len(to_trim):]
I am opening up a binary file like so:
file = open("test/test.x", 'rb')
and reading in lines to a list. Each line looks a little like:
'\xbe\x00\xc8d\xf8d\x08\xe4.\x07~\x03\x9e\x07\xbe\x03\xde\x07\xfe\n'
I am having a hard time manipulating this data. If I try and print each line, python freezes, and emits beeping noises (I think there's a binary beep code in there somewhere). How do I go about using this data safely? How can I convert each hex number to decimal?
To print it, you can do something like this:
print repr(data)
For the whole thing as hex:
print data.encode('hex')
For the decimal value of each byte:
print ' '.join([str(ord(a)) for a in data])
To unpack binary integers, etc. from the data as if they originally came from a C-style struct, look at the struct module.
\xhh is the character with hex value hh. Other characters such as . and `~' are normal characters.
Iterating on a string gives you the characters in it, one at a time.
ord(c) will return an integer representing the character. E.g., ord('A') == 65.
This will print the decimal numbers for each character:
s = '\xbe\x00\xc8d\xf8d\x08\xe4.\x07~\x03\x9e\x07\xbe\x03\xde\x07\xfe\n'
print ' '.join(str(ord(c)) for c in s)
Binary data is rarely divided into "lines" separated by '\n'. If it is, it will have an implicit or explicit escape mechanism to distinguish between '\n' as a line terminator and '\n' as part of the data. Reading such a file as lines blindly without knowledge of the escape mechanism is pointless.
To answer your specific concerns:
'\x07' is the ASCII BEL character, which was originally for ringing the bell on a teletype machine.
You can get the integer value of a byte 'b' by doing ord(b).
HOWEVER, to process binary data properly, you need to know what the layout is. You can have signed and unsigned integers (of sizes 1, 2, 4, 8 bytes), floating point numbers, decimal numbers of varying lengths, fixed length strings, variable length strings, etc etc. Added complication comes from whether the data is recorded in bigendian fashion or littleendian fashion. Once you know all of the above (or have very good informed guesses), the Python struct module should be able to be used for all or most of your processing; the ctypes module may also be useful.
Does the data format have a name? If so, tell us; we may be able to point you to code or docs.
You ask "How do I go about using this data safely?" which begs the question: What do you want to use it for? What manipulations do you want to do?
Like theatrus mentioned, ord and hex might help you.
If you want to try to interpret some sort of structured binary data in the file, the struct module might be helpful.
You are trying to print the data converted to ASCII characters, which will not work.
You can safely use any byte of the data. If you want to print it as a hexadecimal, look at the functions ord and hex/
Are you using read() or readline()? You should be using read(n) to read n bytes; readline() will read until it hits a newline, which the binary file might not have.
In either case, though, you are returned a string of bytes, which may be printable or non-printable characters, and is probably not very useful.
What you want is ord(), which converts a one-byte string into the corresponding integer value. read() from the file one byte at a time and call ord() on the result, or iterate through the entire string.
If you are willing to use NumPy and bitstream, you can do
>>> from numpy import *
>>> from bitstream import BitStream
>>> raw = '\xbe\x00\xc8d\xf8d\x08\xe4.\x07~\x03\x9e\x07\xbe\x03\xde\x07\xfe\n'
>>> stream = BitStream(raw)
>>> stream.read(raw, uint8, len(stream) // 8)
array([190, 0, 200, 100, 248, 100, 8, 228, 46, 7, 126, 3, 158,
7, 190, 3, 222, 7, 254, 10], dtype=uint8)