Is there a simple way to, in Python, read a file's hexadecimal data into a list, say hex?
So hex would be this:
hex = ['AA','CD','FF','0F']
I don't want to have to read into a string, then split. This is memory intensive for large files.
s = "Hello"
hex_list = ["{:02x}".format(ord(c)) for c in s]
Output
['48', '65', '6c', '6c', '6f']
Just change s to open(filename).read() and you should be good.
with open('/path/to/some/file', 'r') as fp:
hex_list = ["{:02x}".format(ord(c)) for c in fp.read()]
Or, if you do not want to keep the whole list in memory at once for large files.
hex_list = ("{:02x}".format(ord(c)) for c in fp.read())
and to get the values, keep calling
next(hex_list)
to get all the remaining values from the generator
list(hex_list)
Using Python 3, let's assume the input file contains the sample bytes you show. For example, we can create it like this
>>> inp = bytes((170,12*16+13,255,15)) # i.e. b'\xaa\xcd\xff\x0f'
>>> with open(filename,'wb') as f:
... f.write(inp)
Now, given we want the hex representation of each byte in the input file, it would be nice to open the file in binary mode, without trying to interpret its contents as characters/strings (or we might trip on the error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xaa in position 0: invalid start byte)
>>> with open(filename,'rb') as f:
... buff = f.read() # it reads the whole file into memory
...
>>> buff
b'\xaa\xcd\xff\x0f'
>>> out_hex = ['{:02X}'.format(b) for b in buff]
>>> out_hex
['AA', 'CD', 'FF', '0F']
If the file is large, we might want to read one character at a time or in chunks. For that purpose I recommend to read this Q&A
Be aware that for viewing hexadecimal dumps of files, there are utilities available on most operating systems. If all you want to do is hex dump the file, consider one of these programs:
od (octal dump, which has a -x or -t x option)
hexdump
xd utility available under windows
Online hex dump tools, such as this one.
Related
I open my file like so :
f = open("filename.ext", "rb") # ensure binary reading with b
My first line of data looks like this (when using f.readline()):
'\x04\x00\x00\x00\x12\x00\x00\x00\x04\x00\x00\x00\xb4\x00\x00\x00\x01\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00\x18\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00\x05\x00\x00\x00\x06\x00\x00\x00:\x00\x00\x00;\x00\x00\x00<\x00\x00\x007\x00\x00\x008\x00\x00\x009\x00\x00\x00\x07\x00\x00\x00\x08\x00\x00\x00\t\x00\x00\x00\n'
Thing is, I want to read this data byte by byte (f.read(4)). While debugging, I realized that when it gets to the end of the first line, it still takes in the newline character \n and it is used as the first byte of the following int I read. I don't want to simply use .splitlines()because some data could have an n inside and I don't want to corrupt it. I'm using Python 2.7.10, by the way. I also read that opening a binary file with the b parameter "takes care" of the new line/end of line characters; why is not the case with me?
This is what happens in the console as the file's position is right before the newline character:
>>> d = f.read(4)
>>> d
'\n\x00\x00\x00'
>>> s = struct.unpack("i", d)
>>> s
(10,)
(Followed from discussion with OP in chat)
Seems like the file is in binary format and the newlines are just mis-interpreted values. This can happen when writing 10 to the file for example.
This doesn't mean that newline was intended, and it is probably not. You can just ignore it being printed as \n and just use it as data.
You should just be able to replace the bytes that indicate it is a newline.
>>> d = f.read(4).replace(b'\x0d\x0a', b'') #\r\n should be bytes b'\x0d\x0a'
>>> diff = 4 - len(d)
>>> while diff > 0: # You can probably make this more sophisticated
... d += f.read(diff).replace(b'\x0d\x0a', b'') #\r\n should be bytes b'\x0d\x0a'
... diff = 4 - len(d)
>>>
>>> s = struct.unpack("i", d)
This should give you an idea of how it will work. This approach could mess with your data's byte alignment.
If you really are seeing "\n" in your print of d then try .replace(b"\n", b"")
I want to open up about 135 different offsets in the file in hex form. The sections of interest are the names of the characters skins in the game, so an easy way to edit these and save them would save me MEGA time.
This is code I ended up with, something I could understand. I converted the file to HEX and TEXT form:
import binascii
filename = 'Skin1.pack'
with open(filename, 'rb') as f:
content = f.read()
out = binascii.hexlify(content)
f = open('hex.txt', 'wb')
f.write(out)
f.close()
import binascii
filename = 'hex.txt'
with open(filename, 'rb') as f:
content = f.read()
asci = binascii.unhexlify(content)
w = open('printed-hex.txt', 'wb')
w.write(asci)
w.close()
Now im trying to use this byte to replace some of the text in the file
f = open("printed-hex.txt",'r')
filedata = f.read()
f.close()
newdata = filedata.replace("K n i g h t ",input)
f = open("printed-hex.txt",'w')
f.write(newdata)
f.close()
but I'm met with this error,
Traceback (most recent call last):
File "C:\Users\Dee\Desktop\ARC to HEX\Edit-Printed-HEX.py", line 3, in <module>
filedata = f.read()
File "C:\Python34\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 2656: character maps to <undefined>
To nitpick, hex doesn't have 'lines' so you might want to think about how you will limit the location you want to edit. Perhaps edit a fixed number of bytes.
The output you have seen in the console is python attempting to print binary data. It has printed the extended characters because there arn't printable characters that correspond to the characters in the string. You can see that some characters are printable, and that is why you have things like 7(5. in it.
What you need is an easy way to represent the binary data as hex, and a way to convert back. I'll leave the implementation of the actual editor up to you.
import mmap
handle = open('/usr/bin/xxd', 'r')
memorymap = mmap.mmap(handle.fileno(), 0, prot=mmap.PROT_READ)
value_to_hex = dict(enumerate('0123456789ABCDEF'))
hex_to_value = {v: k for (k, v) in value_to_hex.items()}
def expand_byte(byte):
""" Converts a single byte into 2 4 bit values """
return [(byte >> s) & 0xF for s in [4, 0]]
def compact_bytes(values):
""" Converts 2 4 bit values into a single byte """
return (values[0] << 4) | values[1]
def bin_to_hex(data):
""" Converts binary data to hex characters """
return [value_to_hex[v] for b in data for v in expand_byte(b)]
def hex_to_bin(hexadecimal):
""" Converts hex characters to binary data """
return [
compact_bytes([hex_to_value[v] for v in hexadecimal[i:i + 2]])
for i in range(0, len(hexadecimal), 2)
]
test_data = [ord(c) for c in memorymap[0:8]]
hex_data = bin_to_hex(test_data)
final_data = hex_to_bin(hex_data)
print "From '{0}'\nto '{1}'\nto '{2}'".format([chr(c) for c in test_data], hex_data, [chr(c) for c in final_data])
This prints:
From '['\x7f', 'E', 'L', 'F', '\x02', '\x01', '\x01', '\x00']'
to '['7', 'F', '4', '5', '4', 'C', '4', '6', '0', '2', '0', '1', '0', '1', '0', '0']'
to '['\x7f', 'E', 'L', 'F', '\x02', '\x01', '\x01', '\x00']'
Bitwise value manipulation is something you may not have come across before, so you should learn about it. The >> << | and & operators are bitwise operators.
To retrieve the data, operate the mmap object like in the example code;
If you want to open a fragment of data in a hex editor, copy it into a temporary file, then open the file in the editor e.g. with subprocess.check_call(), then copy the new file's contents back. (That's unless your editor has a command-line option that allows to set focus at a specific offset at startup)
To use just Python's console, use something like
" ".join("%02x"%ord(c) for c in <data>)
to see the data in hex (or just repr to see it in ASCII), or, for more xxd-like look and feel, something 3rd-party like hexview.
When reading a file (UTF-8 Unicode text, csv) with Python on Linux, either with:
csv.reader()
file()
values of some columns get a zero as their first characeter (there are no zeroues in input), other get a few zeroes, which are not seen when viewing file with Geany or any other editor. For example:
Input
10016;9167DE1;Tom;Sawyer ;Street 22;2610;Wil;;378983561;tom#hotmail.com;1979-08-10 00:00:00.000;0;1;Wil;081208608;NULL;2;IZMH726;2010-08-30 15:02:55.777;2013-06-24 08:17:22.763;0;1;1;1;NULL
Output
10016;9167DE1;Tom;Sawyer ;Street 22;2610;Wil;;0378983561;tom#hotmail.com;1979-08-10 00:00:00.000;0;1;Wil;081208608;NULL;2;IZMH726;2010-08-30 15:02:55.777;2013-06-24 08:17:22.763;0;1;1;1;NULL
See 378983561 > 0378983561
Reading with:
f = file('/home/foo/data.csv', 'r')
data = f.read()
split_data = data.splitlines()
lines = list(line.split(';') for line in split_data)
print data[51220][8]
>>> '0378983561' #should have been '478983561' (reads like this in Geany etc.)
Same result with csv.reader().
Help me solve the mystery, what could be the cause of this? Could it be related to encoding/decoding?
The data you're getting is a string.
print data[51220][8]
>>> '0478983561'
If you want to use this as an integer, you should parse it.
print int(data[51220][8])
>>> 478983561
If you want this as a string, you should convert it back to a string.
print repr(int(data[51220][8]))
>>> '478983561'
csv.reader treats all columns as strings. Conversion to the appropriate type is up to you as in:
print int(data[51220][8])
I am trying to write a pit array to a file in python as in this example: python bitarray to and from file
however, I get garbage in my actual test file:
test_1 = ^#^#
test_2 = ^#^#
code:
from bitarray import bitarray
def test_function(myBitArray):
test_bitarray=bitarray(10)
test_bitarray.setall(0)
with open('test_file.inp','w') as output_file:
output_file.write('test_1 = ')
myBitArray.tofile(output_file)
output_file.write('\ntest_2 = ')
test_bitarray.tofile(output_file)
Any help with what's going wrong would be appreciated.
That's not garbage. The tofile function writes binary data to a binary file. A 10-bit-long bitarray with all 0's will be output as two bytes of 0. (The docs explain that when the length is not a multiple of 8, it's padded with 0 bits.) When you read that as text, two 0 bytes will look like ^#^#, because ^# is the way (many) programs represent a 0 byte as text.
If you want a human-readable text-friendly representation, use the to01 method, which returns a human-readable strings. For example:
with open('test_file.inp','w') as output_file:
output_file.write('test_1 = ')
output_file.write(myBitArray.to01())
output_file.write('\ntest_2 = ')
output_file(test_bitarray.to01())
Or maybe you want this instead:
output_file(str(test_bitarray))
… which will give you something like:
bitarray('0000000000')
In Python, when I try to read in an executable file with 'rb', instead of getting the binary values I expected (0010001 etc.), I'm getting a series of letters and symbols that I do not know what to do with.
Ex: ???}????l?S??????V?d?\?hG???8?O=(A).e??????B??$????????: ???Z?C'???|lP#.\P?!??9KRI??{F?AB???5!qtWI??8???!ᢉ?]?zъeF?̀z??/?n??
How would I access the binary numbers of a file in Python?
Any suggestions or help would be appreciated. Thank you in advance.
That is the binary. They are stored as bytes, and when you print them, they are interpreted as ASCII characters.
You can use the bin() function and the ord() function to see the actual binary codes.
for value in enumerate(data):
print bin(ord(value))
Byte sequences in Python are represented using strings. The series of letters and symbols that you see when you print out a byte sequence is merely a printable representation of bytes that the string contains. To make use of this data, you usually manipulate it in some way to obtain a more useful representation.
You can use ord(x) or bin(x) to obtain decimal and binary representations, respectively:
>>> f = open('/tmp/IMG_5982.JPG', 'rb')
>>> data = f.read(10)
>>> data
'\x00\x00II*\x00\x08\x00\x00\x00'
>>> data[2]
'I'
>>> ord(data[2])
73
>>> hex(ord(data[2]))
'0x49'
>>> bin(ord(data[2]))
'0b1001001'
>>> f.close()
The 'b' flag that you pass to open() does not tell Python anything about how to represent the file contents. From the docs:
Append 'b' to the mode to open the file in binary mode, on systems that differentiate between binary and text files; on systems that don’t have this distinction, adding the 'b' has no effect.
Unless you just want to look at what the binary data from the file looks like, Mark Pilgrim's book, Dive Into Python, has an example of working with binary file formats. The example shows how you can read IDv1 tags from an MP3 file. The book's website seems to be down, so I'm linking to a mirror.
Each character in the string is the ASCII representation of a binary byte. If you want it as a string of zeros and ones then you can convert each byte to an integer, format it as 8 binary digits and join everything together:
>>> s = "hello world"
>>> ''.join("{0:08b}".format(ord(x)) for x in s)
'0110100001100101011011000110110001101111001000000111011101101111011100100110110001100100'
Depending on if you really need to analyse / manipulate things at the binary level an external module such as bitstring could be helpful. Check out the docs; to just get the binary interpretation use something like:
>>> f = open('somefile', 'rb')
>>> b = bitstring.Bits(f)
>>> b.bin
0100100101001001...
Use ord(x) to get the integer value of each byte.
>>> with open('settings.dat', 'rb') as file:
... data = file.read()
...
>>> for index, value in enumerate(data):
... print '0x%08x 0x%02x' % (index, ord(value))
...
0x00000000 0x28
0x00000001 0x64
0x00000002 0x70
0x00000003 0x30
0x00000004 0x0d
0x00000005 0x0a
0x00000006 0x53
0x00000007 0x27
0x00000008 0x4d
0x00000009 0x41
0x0000000a 0x49
0x0000000b 0x4e
0x0000000c 0x5f
0x0000000d 0x57
0x0000000e 0x49
0x0000000f 0x4e
If you realy want to convert the binaray bytes to a stream of bits, you have to remove the first two chars ('0b') from the output of bin() and reverse the result:
with open("settings.dat", "rb") as fp:
print "".join( (bin(ord(c))[2:][::-1]).ljust(8,"0") for c in fp.read() )
If you use Python prior to 2.6, you have no bin() function.