I have a binary file with some character data stuck in the middle of a bunch of ints and floats. I am trying to read with numpy. The farthest I have been able to get regarding the character data is:
strbits = np.fromfile(infile,dtype='int8',count=73)
(Yes, it's a 73-character string.)
Three questions: Is my data now stored without corruption or truncation in strbits? And, can I now convert strbits into a readable string? Finally, should I be doing this in some completely different way?
UPDATE:
Here's something that works, but I would think there would be a more elegant way.
strarr = np.zeros(73,dtype='c')
for n in range(73):
strarr[n] = np.fromfile(infile,dtype='c',count=1)[0]
So now I have an array where each element is a single character from the input file.
The way you're doing it is fine. Here's how you can convert it to a string.
strbits = np.fromfile(infile, dtype=np.int8, count=73)
a_string = ''.join([chr(item) for item in strbits])
Related
I have a long array of items (4700) that will ultimately be 1 or 0 when compared to settings in another list. I want to be able to construct a single integer/string item that I can store in some of the metadata such that it can be accessed later in order to uniquely identify the combination of items that goes into it.
I am writing this all in Python. I am thinking of doing something like zlib compression plus a hex conversion, but I am getting myself confused as to how to do the inverse transformation. So assuming bin_string is the string array of 1's and 0's it should look something like this
import zlib
#example bin_string, real one is much longer
bin_string="1001010010100101010010100101010010101010000010100101010"
compressed = zlib.compress(bin_string.encode())
this_hex = compressed.hex()
where I can then save this_hex to the metadata. The question is, how do I get the original bin_string back from my hex value? I have lots of Python experience with numerical methods and such but little with compression, so any basic insights would be very valuable.
Just do the inverse of each operation. This:
zlib.decompress(bytearray.fromhex(this_hex)).decode()
will return your original string.
It would be faster and might even result in better compression to simply encode your bits as bits in a byte string, along with a terminating one bit followed by zeros to pad out the last byte. That would be seven bytes instead of the 22 you're getting from zlib.compress(). zlib would do better only if there is a strong bias for 0's or 1's, and/or there are repeating patterns in the 0's and 1's.
As for encoding for the metadata, Base64 would be more compact than hexadecimal. Your example would be lKVKVKoKVQ==.
You should try using the .savez_compressed() method of numpy
Convert your simple array into a numpy array amd then use this -
numpy.savez_compressed("filename.npz")
Use
numpy.load()
To load the .npz file
I am trying to read from a file a bunch of hex numbers.
lines ='4005297103CE40C040059B532A7472C440061509BB9597D7400696DBCF1E35CC4007206BB5B0A67B4007AF4B08111B87400840D4766460524008D47E0FFB4ABA400969A572EBAFE7400A0107CCFDF50E'
dummy = [lines[index][i:i+16] for i in range(0, len(lines[index]),16)]
rdummy=[]
for elem in dummy[:-1]:
rdummy.append(int(elem,16))
these are 10 number of 16 digits
in particular when reading the first one, I have:
print(dummy[0])
4005297103CE40C0
now I would like to convert it to float
I have an IDL script that when reading this number gives 2.64523509
the command used in IDL is
double(4613138958682833088,0)
where it appers 0 is an offset used when converting.
is there a way to do this in python?
you probably want to use the struct package for this, something like this seems to work:
import struct
lines ='4005297103CE40C040059B532A7472C440061509BB9597D7400696DBCF1E35CC4007206BB5B0A67B4007AF4B08111B87400840D4766460524008D47E0FFB4ABA400969A572EBAFE7400A0107CCFDF50E'
for [value] in struct.iter_unpack('>d', bytes.fromhex(lines)):
print(value)
results in 2.64523509 being printed first which seems about right
So I had a question about a serialization algorithm I just came up with it, wanted to know if it already exists and if there's a better version out there.
So we know normal algorithms use a delimiter and join words in a list, but then you have to look through the whole word for existence of the delimiter, escape, etc, or make the serialization algorithm not robust. I thought a more intuitive approach would be to use higher level languages like Python where len() is O(1) and prepend that to each word. So for example this code I attached.
Wouldn't this be faster because instead of going through every letter of every word we instead just go through every word? And then deserialization we don't have to look through every character to find the delimiter, we can just skip directly to the end of each word.
The only problem I see is that double digit sizes would cause problems, but I'm sure there's a way around that I haven't found yet.
It was suggested to me that protocol buffers are similar to this idea, but I haven't understood why yet.
def serialize(list_o_words):
return ''.join(str(len(word)) + word for word in list_o_words)
def deserialize(serialized_list_o_words):
index = 0
deserialized_list = []
while index < len(serialized_list_o_words):
word_length = int(serialized_list_o_words[index])
next_index = index + word_length + 1
deserialized_list.append(serialized_list_o_words[index+1:next_index])
index = next_index
return deserialized_list
serialized_list = "some,comma,separated,text".split(",")
print(serialize(serialized_list))
print(deserialize(serialize(serialized_list)) == serialized_list)
Essentially, I want to know how I can handle double digit lengths.
There are many variations on length-prefixed strings, but the key bits come down to how you store the length.
You're deserializing the lengths as a single-character ASCII number, which means you can only handle lengths from 0 to 9. (You don't actually test that on the serialize size, so you can generate garbage, but let's forget that.)
So, the obvious option is to use 2 characters instead of 1. Let's add in a bit of error handling while we're at it; the code is still pretty easy:
def _len(word):
s = format(len(word), '02')
if len(s) != 2:
raise ValueError(f"Can't serialize {s}; it's too long")
return s
def serialize(list_o_words):
return ''.join(_len(word) + word for word in list_o_words)
def deserialize(serialized_list_o_words):
index = 0
deserialized_list = []
while index+1 < len(serialized_list_o_words):
word_length = int(serialized_list_o_words[index:index+2])
next_index = index + word_length + 2
deserialized_list.append(serialized_list_o_words[index+2:next_index])
index = next_index
return deserialized_list
But now you can't handle strings >99 characters.
Of course you can keep adding more digits for longer strings, but if you think "I'm never going to need a 100,000-character string"… you are going to need it, and then you'll have a zillion old files in the 5-digit format that aren't compatible with the new 6-digit format.
Also, this wastes a lot of bytes. If you're using 5-digit lengths, s encodes as 00000s, which is 6x as big as the original value.
You can stretch things a lot farther by using binary lengths instead of ASCII. Now, with two bytes, we can handle lengths up to 65535 instead of just 99. And if you go to four or eight bytes, that might actually be big enough for all your strings ever. Of course this only works if you're storing bytes rather than Unicode strings, but that's fine; you probably needed to encode your strings for persistence anyway. So:
def _len(word):
# already raises an exception for lengths > 65535
s = struct.pack('>H', len(word))
def serialize(list_o_words):
utfs8 = (word.encode() for word in list_o_words)
return b''.join(_len(utf8) + utf8 for utf8 in utfs8)
Of course this isn't very human-readable or -editable; you need to be comfortable in a hex editor to replace a string in a file this way.
Another option is to delimit the lengths. This may sound like a step backward—but it still gives us all the benefits of knowing the length in advance. Sure, you have to "read until comma", but you don't have to worry about escaped or quoted commas the way you do with CSV files, and if you're worried about performance, it's going to be much faster to read a buffer of 8K at a time and chunk through it with some kind of C loop (whether that's slicing, or str.find, barely matters by comparison) than to actually read either until comma or just two bytes.
This also has the benefit of solving the sync problem. With delimited values, if you come in mid-stream, or get out of sync because of an error, it's no big deal; just read until the next unescaped delimiter and worst-case you missed a few values. With length-prefixed values, if you're out of sync, you're reading arbitrary characters and treating them as a length, which just throws you even more out of sync. The netstring format is a minor variation on this idea, with a tiny bit more redundancy to make sync problems easier to detect/recover from.
Going back to binary lengths, there are all kinds of clever tricks for encoding variable-length numbers. Here's one idea, in pseudocode:
if the current byte is < hex 0x80 (128):
that's the length
else:
add the low 7 bits of the current byte
plus 128 times (recursively process the next byte)
Now you can handle short strings with just 1 byte of length, but if a 5-billion-character string comes along, you can handle that too.
Of course this is even less human-readable than fixed binary lengths.
And finally, if you ever want to be able to store other kinds of values, not just strings, you probably want a format that uses a "type code". For example, use I for 32-bit int, f for 64-bit float, D for datetime.datetime, etc. Then you can use s for strings <256 characters with a 1-byte length, S for strings <65536 characters with a 2-byte length, z for string <4B characters with a 4-byte length, and Z for unlimited strings with a complicated variable-int length (or maybe null-terminated strings, or maybe an 8-byte length is close enough to unlimited—after all, nobody's ever going to want more than 640KB in a computer…).
I have a program in Python which analyses file headers and decides which file type it is. (https://github.com/LeoGSA/Browser-Cache-Grabber)
The problem is the following:
I read first 24 bytes of a file:
with open (from_folder+"/"+i, "rb") as myfile:
header=str(myfile.read(24))
then I look for pattern in it:
if y[1] in header:
shutil.move (from_folder+"/"+i,to_folder+y[2]+i+y[3])
where y = ['/video', r'\x47\x40\x00', '/video/', '.ts']
y[1] is the pattern and = r'\x47\x40\x00'
the file has it inside, as you can see from the picture below.
the program does NOT find this pattern (r'\x47\x40\x00') in the file header.
so, I tried to print header:
You see? Python sees it as 'G#' instead of '\x47\x40'
and if i search for 'G#'+r'\x00' in header - everything is ok. It finds it.
Question: What am I doing wrong? I want to look for r'\x47\x40\x00' and find it. Not for some strange 'G#'+r'\x00'.
OR
why python sees first two numbers as 'G#' and not as '\x47\x40', though the rest of header it sees in HEX? Is there a way to fix it?
with open (from_folder+"/"+i, "rb") as myfile:
header=myfile.read(24)
header = str(binascii.hexlify(header))[2:-1]
the result I get is:
And I can work with it
4740001b0000b00d0001c100000001efff3690e23dffffff
P.S. But anyway, if anybody will explain what was the problem with 2 first bytes - I would be grateful.
In Python 3 you'll get bytes from a binary read, rather than a string.
No need to convert it to a string by str.
Print will try to convert bytes to something human readable.
If you don't want that, convert your bytes to e.g. hex representations of the integer values of the bytes by:
aBytes = b'\x00\x47\x40\x00\x13\x00\x00\xb0'
print (aBytes)
print (''.join ([hex (aByte) for aByte in aBytes]))
Output as redirected from the console:
b'\x00G#\x00\x13\x00\x00\xb0'
0x00x470x400x00x130x00x00xb0
You can't search in aBytes directly with the in operator, since aBytes isn't a string but an array of bytes.
If you want to apply a string search on '\x00\x47\x40', use:
aBytes = b'\x00\x47\x40\x00\x13\x00\x00\xb0'
print (aBytes)
print (r'\x'.join ([''] + ['%0.2x'%aByte for aByte in aBytes]))
Which will give you:
b'\x00G#\x00\x13\x00\x00\xb0'
\x00\x47\x40\x00\x13\x00\x00\xb0
So there's a number of separate issues at play here:
print tries to print something human readable, which succeeds only for the first two chars.
You can't directly search for bytearrays in bytearrays with in, so convert them to a string containing fixed length hex representations as substrings, as shown.
I am wanting to read a binary file using Python. I've so far used numpy.fromfile but have not been able to figure out the structure of the resultant array. I have an IDL function that would read the file, so this is the only thing I have to go on. I have no knowledge of IDL at all.
The following IDL function will read the file and return lc,zgrid,fnu,efnu etc.:
openr,lun,'file.dat',/swap_if_big_endian,/get_lun
s = lonarr(4) & readu,lun,s
NFILT=s[0] & NTEMP = s[1] & NZ = s[2] & NOBJ = s[3]
tempfilt = dblarr(NFILT,NTEMP,NZ)
lc = dblarr(NFILT) ; central wavelengths
zgrid = dblarr(NZ)
fnu = dblarr(NFILT,NOBJ)
efnu = dblarr(NFILT,NOBJ)
readu,lun,tempfilt,lc,zgrid,fnu,efnu
close,/all
But am unsure how to replicate this in Python. Any help is appreciated. Thanks.
I'm not looking for translation. I'm looking for a springboard from which I can try and solve this problem.
To read a binary file (assuming this is 32 bits or something the user already knows), I would first make a method that uses,
>>> a = '00011111001101110000101010101010'
>>> int(a,2)
523700906
That is, our method has to convert this from something we make ourselves, such as:
def binaryToAscii(string_of_binary):
'''
binaryToAscii takes in a string of binary and returns an ASCII character
'''
charVal = int(string_of_binary,2)
char = chr(charVal)
return char
The next step would be to make a method that incorporates binaryToAscii in such a way that we are either concatenating some string, or writing to a new file. This should be left to the user to decide.
As an aside, if you are not retrieving the binary as a string, then there our built in methods that turn unicode characters into ascii values by taking in there unicode value (binary included).
Regarding the reading of a file, the same link for reading and writing to a file can be used.