How do I convert binary file data into numbers? - python

I have a raw data file, that represent voltages recorded by a device. I want to convert the binary file into a file with numbers that can be plotted.
The raw data is little endian and each sample is 3 bytes (24bit). When I view the file with a text editor, I can strange, unreadable characters as shown below:
</DataInfo>
ò ì ê ì ð ô ù þ ý ù ø ÷ õ ò ï î î ï ò ô ò ï î î ï í é ç ë ò ø ú ü þ
Which makes sense because the data is still binary. So I used command line to produce a hexadecimal file that looks like:
000007F0 2F 46 50 3E 0A 3C 2F 44 61 74 61 49 6E 66 6F 3E /FP>.</DataInfo>
00000800 0D 0A 0D 0A F2 08 00 EC 08 00 EA 08 00 EC 08 00 ....ò..ì..ê..ì..
00000810 F0 08 00 F4 08 00 F9 08 00 FE 08 00 00 09 00 FD ð..ô..ù..þ.....ý
00000820 08 00 F9 08 00 F8 08 00 F7 08 00 F5 08 00 F2 08 ..ù..ø..÷..õ..ò.
My issue is when I convert the hexadecimal to a decimal number, the number is way too large to be correct and I can't figure out what went wrong?
I don't have a lot of knowledge on binary files but not even sure where to look, so any guidance is greatly appreciated!
FYI
I have some python programming skills and the file I am working with can be seen here: https://drive.google.com/file/d/1WZ6OBPLKIqrxw1GsvG776jqD8U08CMD9/view?usp=sharing

It looks like there is a header which ends with 0D 0A 0D 0A.
So the data starts at byte 0x804.
You can use this trick to parse it (NumPy: 3-byte, 6-byte types (aka uint24, uint48))
a = "F2 08 00 EC 08 00 EA 08 00 EC 08 00".replace(" ", "")
a = bytes(bytearray.fromhex(a))
a = np.frombuffer(a, dtype='<u1')
e = np.zeros(a.size // 3, np.dtype('<u4'))
for i in range(3):
e.view(dtype='<u1')[i::4] = a.view(dtype='<u1')[i::3]
print(e)
# Outputs [2290 2284 2282 2284]

Related

How do I match multiline expressions with junk in the middle?

I'm trying to match a multiline expression from some logs we have. The biggest problem is due to race-conditions, we sometimes have to use a custom print function with a mutex, and sometimes (when that's not necessary) we just use printf. This results in two types of logs.
My solution was this monstrosity:
changed key '(\w+)' value: <((([0-9a-f]{2} *)+)(?:\n)*(?:<\d+> \w+ (?:.*?] \[\d+\])\s*)*)*>
Explanation of the above regex:
changed key '(\w+)' value: - This is how we detect a print (and save the keyname in a capture group).
<{regex}> - The value output starts with < and ends with >
([0-9a-f]{2} *) - The bytes are hexadecimal pairs followed by an optional space (because last byte doesn't have a space). Let's call this capture group 4.
({group4}+) - One or more of group 4.
(?:\n)* - There can be 0 or more newlines after this "XX " pair. (non-capture)
(?:<\d+> \w+ (?:.*?] \[\d+\])\s*)* - There can be 0 or more prints of the timestamp. (non-capture)
This works for the Case 2 logs, but not for the Case 1 logs. In Case 1, for some reason only the last line is matched.
Essentially, I'm trying to match this (two capture groups):
changed key '(\w+)' value: <({only hexadecimal pairs})>
group 1: key
group 2: value
Below is the dummy cases (same value in all cases):
// Case 1
<22213> Nov 30 00:00:00.287 [D1] [128]changed key 'KEY_NAME' value: <ab ab ab ab 00 00 00
<22213> Nov 30 00:00:00.287 [D1] [128]
<22213> Nov 30 00:00:00.287 [D1] [128]00 04 00 00
<22213> Nov 30 00:00:00.287 [D1] [128]ff ff
<22213> Nov 30 00:00:00.287 [D1] [128]00 00 00 11 00 00 00 00 00 21>
// Case 2
changed key 'KEY_NAME' value: <ab ab ab ab 00 00 00 00 04 00 00 ff ff 00 00 00 11 00 00 00 00 00 21>
// Case 2 with some newlines in the middle
changed key 'KEY_NAME' value: <ab ab ab ab 00 00 00 00
04 00 00 ff
ff 00 00 00 11 00
00 00 00 00 21>
The key isn't always the same key, so the value (and the value length) can change.
This approach starts by first stripping out the leading log content of each line, leaving behind the content you want to target. After that, it does an re.findall search using a regex pattern similar to the one you are already using.
inp = """<22213> Nov 30 00:00:00.287 [D1] [128]changed key 'KEY_NAME' value: <ab ab ab ab 00 00 00
<22213> Nov 30 00:00:00.287 [D1] [128]
<22213> Nov 30 00:00:00.287 [D1] [128]00 04 00 00
<22213> Nov 30 00:00:00.287 [D1] [128]ff ff
<22213> Nov 30 00:00:00.287 [D1] [128]00 00 00 11 00 00 00 00 00 21>"""
inp = re.sub(r'^<.*?>.*?(?:\s+\[.*?\])+', '', inp, flags=re.M)
matches = re.findall(r"changed key '(\w+)' value: <(.*?)>", inp, flags=re.S)
matches = [(x[0], re.sub(r'\s+', ' ', x[1])) for x in matches]
print(matches)
This prints:
[('KEY_NAME', 'ab ab ab ab 00 00 00 00 04 00 00 ff ff 00 00 00 11 00 00 00 00 00 21')]
Assuming there could be unwanted values in between 'KEY_NAME' value: < and the closing >, we can use re.findall on the second group to match all hexadecimal values:
inp = re.sub(r'^<.*?>.*?(?:\s+\[.*?\])+', '', inp, flags=re.M)
matches = re.findall(r"changed key '(\w+)' value: <(.*?)>", inp, flags=re.S)
matches = [(x[0], ' '.join(re.findall(r'\b[a-f0-9]{2}\b', x[1]))) for x in matches]
print(matches) # output same as above

How to extract IPv6 field attribute values from the header present in hexadecimal string value in a file using Python?

An IPv6 header has the following value :
68 01 00 00 31 02 FF 2A 01 3F 4D 9C 7E 11 14 56 19 DE A0 BD CD 17 FF CD DF 01 03 04 BC 2B 3A 4E 9D AB DE 9D AE 07 FF (IN TXT FILE)
After removing white spaces:
680100003102FF2A013F4D9C7E11145619DEA0BDCD17FFCDDF010304BC2B3A4E9DABDE9DAE07FF
The above header is present in a file and I'm trying to extract all the field attributes present in the string as mentioned in the picture.
Here the picture contains header field attribute details:
I tried slicing and checking if the value falls below the upper limit(by using byte size) but it doesn't work when alphabets (in the hex format) come into the picture.
Is there any optimal and error-free way to do this generically in python?

weird behavior when searching with b'\x5e', but with b'\x4e' is ok

import re
def main():
with open('test', 'wb') as f:
f.write(b'\x1e\x2e\x3e\x4e\x5e\x6e')
f.write(b'\x1e\x2e\x3e\x4e\x5e\x6e')
with open('test', 'rb') as f:
s = f.read()
for i in range((len(s)//8)+1):
print(' '.join(['{:02x}'.format(j) for j in s[i*8:(i+1)*8]]))
regex = re.compile(b'\x5e') # weird
for match_obj in regex.finditer(s):
start = match_obj.start()
end = match_obj.end()
print(start, end)
if __name__=='__main__':
main()
after I executed the code with pattern b'\x5e', I got
1e 2e 3e 4e 5e 6e 1e 2e
3e 4e 5e 6e
0 0
if i changed the pattern to b'\x4e' and run again, I would get
1e 2e 3e 4e 5e 6e 1e 2e
3e 4e 5e 6e
3 4
9 10
why do they work in different ways?
how do I fix it?
thanks
0x5e is "^" in ASCII, which is a regex metacharacter. You will need to escape it if you want to use it in a pattern literally.
>>> re.escape(b'\x5e')
b'\\^'

How to print binary file as bytes?

I did
>>> b0 = open('file','rb')
Then
>>> b0.read(10)
gives
b'\xb8\xaaK\x1e^J)\xab_I'
How can I get things printed all as pure hex bytes? I want
b'\xb8\xaa\x4b\x1e\x5e\x4a\x29\xab\x5f\x49'
(PS: is it possible to print it pretty? like
B8 AA 4B 1E 5E 4A 29 AB 5F 49
or colon separated.)
>>> s = b'\xb8\xaaK\x1e^J)\xab_I'
>>> ' '.join('{:02X}'.format(c) for c in s)
'B8 AA 4B 1E 5E 4A 29 AB 5F 49'
or, slightly more concisely:
>>> ' '.join(map('{:02X}'.format, s))
'B8 AA 4B 1E 5E 4A 29 AB 5F 49'

printing number of snmpwalk results

Were trying to make a script on a Ubuntu server that reads the number of results from an snmpwalk command, and then sending it to Cacti for graphing.
Since none of us have any kind of programming knowledge and from what we have tried, we havent succeed.
It will go like this:
the script runs: snmpwalk -v 1 -c public -Cp 10.59.193.141 .1.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1
The command will print
iso.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1.0.34.250.121.174.124 = Hex-STRING: 00 22 FA 79 AE 7C
iso.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1.0.35.20.11.246.64 = Hex-STRING: 00 23 14 0B F6 40
iso.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1.0.38.198.89.34.192 = Hex-STRING: 00 26 C6 59 22 C0
iso.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1.40.224.44.221.222.148 = Hex-STRING: 28 E0 2C DD DE 94
iso.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1.100.163.203.10.120.83 = Hex-STRING: 64 A3 CB 0A 78 53
iso.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1.120.214.240.8.133.165 = Hex-STRING: 78 D6 F0 08 85 A5
iso.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1.132.0.210.179.213.93 = Hex-STRING: 84 00 D2 B3 D5 5D
iso.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1.132.0.210.201.8.196 = Hex-STRING: 84 00 D2 C9 08 C4
iso.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1.140.112.90.108.236.188 = Hex-STRING: 8C 70 5A 6C EC BC
iso.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1.140.112.90.139.18.244 = Hex-STRING: 8C 70 5A 8B 12 F4
iso.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1.180.240.171.112.37.69 = Hex-STRING: B4 F0 AB 70 25 45
Variables found: 11
Then the script should somehow do: read until Variables found: and read "11", and then print "11".
So basically we want the script to filter out the number "11" in this case which we can use in Cacti for graphing. We've tried some scripts on google and looked around for information, but found nothing.
I think it should be easy if you know how to do it, but we are beginners at programming.
Thanks in advance!
Using perl, add following command after a pipe to extract the number you want:
... | perl -ne 'm/\A(?i)variables\s+/ and m/(\d+)\s*$/ and printf qq|%s\n|, $1 and exit'
It will print:
11

Categories

Resources