How do I match multiline expressions with junk in the middle?

How do I match multiline expressions with junk in the middle? - python

I'm trying to match a multiline expression from some logs we have. The biggest problem is due to race-conditions, we sometimes have to use a custom print function with a mutex, and sometimes (when that's not necessary) we just use printf. This results in two types of logs.
My solution was this monstrosity:
changed key '(\w+)' value: <((([0-9a-f]{2} *)+)(?:\n)*(?:<\d+> \w+ (?:.*?] \[\d+\])\s*)*)*>
Explanation of the above regex:
changed key '(\w+)' value: - This is how we detect a print (and save the keyname in a capture group).
<{regex}> - The value output starts with < and ends with >
([0-9a-f]{2} *) - The bytes are hexadecimal pairs followed by an optional space (because last byte doesn't have a space). Let's call this capture group 4.
({group4}+) - One or more of group 4.
(?:\n)* - There can be 0 or more newlines after this "XX " pair. (non-capture)
(?:<\d+> \w+ (?:.*?] \[\d+\])\s*)* - There can be 0 or more prints of the timestamp. (non-capture)
This works for the Case 2 logs, but not for the Case 1 logs. In Case 1, for some reason only the last line is matched.
Essentially, I'm trying to match this (two capture groups):
changed key '(\w+)' value: <({only hexadecimal pairs})>
group 1: key
group 2: value
Below is the dummy cases (same value in all cases):
// Case 1
<22213> Nov 30 00:00:00.287 [D1] [128]changed key 'KEY_NAME' value: <ab ab ab ab 00 00 00
<22213> Nov 30 00:00:00.287 [D1] [128]
<22213> Nov 30 00:00:00.287 [D1] [128]00 04 00 00
<22213> Nov 30 00:00:00.287 [D1] [128]ff ff
<22213> Nov 30 00:00:00.287 [D1] [128]00 00 00 11 00 00 00 00 00 21>
// Case 2
changed key 'KEY_NAME' value: <ab ab ab ab 00 00 00 00 04 00 00 ff ff 00 00 00 11 00 00 00 00 00 21>
// Case 2 with some newlines in the middle
changed key 'KEY_NAME' value: <ab ab ab ab 00 00 00 00
04 00 00 ff
ff 00 00 00 11 00
00 00 00 00 21>
The key isn't always the same key, so the value (and the value length) can change.

This approach starts by first stripping out the leading log content of each line, leaving behind the content you want to target. After that, it does an re.findall search using a regex pattern similar to the one you are already using.
inp = """<22213> Nov 30 00:00:00.287 [D1] [128]changed key 'KEY_NAME' value: <ab ab ab ab 00 00 00
<22213> Nov 30 00:00:00.287 [D1] [128]
<22213> Nov 30 00:00:00.287 [D1] [128]00 04 00 00
<22213> Nov 30 00:00:00.287 [D1] [128]ff ff
<22213> Nov 30 00:00:00.287 [D1] [128]00 00 00 11 00 00 00 00 00 21>"""
inp = re.sub(r'^<.*?>.*?(?:\s+\[.*?\])+', '', inp, flags=re.M)
matches = re.findall(r"changed key '(\w+)' value: <(.*?)>", inp, flags=re.S)
matches = [(x[0], re.sub(r'\s+', ' ', x[1])) for x in matches]
print(matches)
This prints:
[('KEY_NAME', 'ab ab ab ab 00 00 00 00 04 00 00 ff ff 00 00 00 11 00 00 00 00 00 21')]
Assuming there could be unwanted values in between 'KEY_NAME' value: < and the closing >, we can use re.findall on the second group to match all hexadecimal values:
inp = re.sub(r'^<.*?>.*?(?:\s+\[.*?\])+', '', inp, flags=re.M)
matches = re.findall(r"changed key '(\w+)' value: <(.*?)>", inp, flags=re.S)
matches = [(x[0], ' '.join(re.findall(r'\b[a-f0-9]{2}\b', x[1]))) for x in matches]
print(matches) # output same as above

Related

How do I convert binary file data into numbers?

I have a raw data file, that represent voltages recorded by a device. I want to convert the binary file into a file with numbers that can be plotted.
The raw data is little endian and each sample is 3 bytes (24bit). When I view the file with a text editor, I can strange, unreadable characters as shown below:
</DataInfo>
ò ì ê ì ð ô ù þ ý ù ø ÷ õ ò ï î î ï ò ô ò ï î î ï í é ç ë ò ø ú ü þ
Which makes sense because the data is still binary. So I used command line to produce a hexadecimal file that looks like:
000007F0 2F 46 50 3E 0A 3C 2F 44 61 74 61 49 6E 66 6F 3E /FP>.</DataInfo>
00000800 0D 0A 0D 0A F2 08 00 EC 08 00 EA 08 00 EC 08 00 ....ò..ì..ê..ì..
00000810 F0 08 00 F4 08 00 F9 08 00 FE 08 00 00 09 00 FD ð..ô..ù..þ.....ý
00000820 08 00 F9 08 00 F8 08 00 F7 08 00 F5 08 00 F2 08 ..ù..ø..÷..õ..ò.
My issue is when I convert the hexadecimal to a decimal number, the number is way too large to be correct and I can't figure out what went wrong?
I don't have a lot of knowledge on binary files but not even sure where to look, so any guidance is greatly appreciated!
FYI
I have some python programming skills and the file I am working with can be seen here: https://drive.google.com/file/d/1WZ6OBPLKIqrxw1GsvG776jqD8U08CMD9/view?usp=sharing

It looks like there is a header which ends with 0D 0A 0D 0A.
So the data starts at byte 0x804.
You can use this trick to parse it (NumPy: 3-byte, 6-byte types (aka uint24, uint48))
a = "F2 08 00 EC 08 00 EA 08 00 EC 08 00".replace(" ", "")
a = bytes(bytearray.fromhex(a))
a = np.frombuffer(a, dtype='<u1')
e = np.zeros(a.size // 3, np.dtype('<u4'))
for i in range(3):
e.view(dtype='<u1')[i::4] = a.view(dtype='<u1')[i::3]
print(e)
# Outputs [2290 2284 2282 2284]

Python: grouping items already in a list and reverse them

I have a binary file likes this:
00 01 02 04 03 03 03 03 00 05 06 03 03 03 03 03 00 07 03 03 03 03 03 03 ...
and I would like to make groups of 8 items each
[00 01 02 04 03 03 03 03] [00 05 06 03 03 03 03 03] [00 07 03 03 03 03 03 03]...
and then reverse the items inside each group like this:
[03 03 03 03 04 02 01 00] [03 03 03 03 03 06 05 00] [03 03 03 03 03 03 07 00]
I tried reverse() but it reverse all the list.
I've imagined something like that: in a loop I should count until 8 (or 7), make a group, reverse it, and then increment the row, count 8, reverse and so on but I am not able to code that.
I have tried
i=0
for item in (list_reverse):
i+=1
if i>8:
list_reverse.reverse()
i=0
but it doesn't work.
Maybe I should try a nested loop?

.split() the strings, then loop through it.
t = """00 01 02 04 03 03 03 03
00 05 06 03 03 03 03 03
00 07 03 03 03 03 03 03"""
out = ""
lines = t.split("\n")
for n, line in enumerate(lines):
lst = line.split(" ")
c = 0
reversed_lst = ""
while c < len(lst):
reversed_lst += (lst[len(lst)- c -1]) + " "; c+=1
if n != len(lines) - 1:
out += reversed_lst + "\n"
else:
out += reversed_lst
print(out)
Output:
03 03 03 03 04 02 01 00
03 03 03 03 03 06 05 00
03 03 03 03 03 03 07 00

This is a good usecase for Python's builtin bytearray. First, you can open the binary input file and use a bytearray to store its contents:
with open("binary.file", "rb") as f:
bytes = bytearray(f.read())
Then, we'll use your algorithm to loop over the bytes and store them in a variable called group (which is also a bytearray) every 8 iterations:
i = 0
groups = []
group = bytearray()
for byte in bytes:
i += 1
group.append(byte)
if i == 8:
groups.append(group)
i = 0
group = bytearray()
Just as quick sanity check, test the value of variable i because if it's not zero by now the final group would be less than 8 bytes long:
if i != 0:
raise EOFError("Input file does not align to 8 byte boundary!")
Finally, we'll reverse each group and print the output:
for group in groups:
group.reverse()
print(groups)
Depending on your usecase, you could also concatenate the reversed bytes and store them in another file or even overwrite the same file. Although my guess is if you would do that to any ordinary binary files like a JPEG or an EXE you will break them completely. Luckily, you could run the program again to restore them!

How to extract IPv6 field attribute values from the header present in hexadecimal string value in a file using Python?

An IPv6 header has the following value :
68 01 00 00 31 02 FF 2A 01 3F 4D 9C 7E 11 14 56 19 DE A0 BD CD 17 FF CD DF 01 03 04 BC 2B 3A 4E 9D AB DE 9D AE 07 FF (IN TXT FILE)
After removing white spaces:
680100003102FF2A013F4D9C7E11145619DEA0BDCD17FFCDDF010304BC2B3A4E9DABDE9DAE07FF
The above header is present in a file and I'm trying to extract all the field attributes present in the string as mentioned in the picture.
Here the picture contains header field attribute details:
I tried slicing and checking if the value falls below the upper limit(by using byte size) but it doesn't work when alphabets (in the hex format) come into the picture.
Is there any optimal and error-free way to do this generically in python?

Reading EEPROM addresses in Python and perform operations

I am currently trying to match pattern for an eeprom dump text file to locate a certain address and then traverse 4 steps once I hit upon in the search. I have tried the following code for finding the pattern
regexp_list = ('A1 B2')
line = open("dump.txt", 'r').read()
pattern = re.compile(regexp_list)
matches = re.findall(pattern,line)
for match in matches:
print(match)
this scans the dump for A1 B2 and displays if found. I need to add more such addresses in search criteria for ex: 'C1 B2', 'D1 F1'.
I tried making the regexp_list as a list and not a tuple, but it didn't work.
This is one of the problem. Next when I hit upon the search, I want to traverse 4 places and then read the address from there on (See below).
Input:
0120 86 1B 00 A1 B2 FF 15 A0 05 C2 D1 E4 00 25 04 00
Here when the search finds A1 B2 pattern, I want to move 4 places i.e to save data from C2 D1 E4 from the dump.
Expected Output:
C2 D1 E4
I hope the explanation was clear.
#
Thanks to #kcorlidy
Here's the final piece of code which I had to enter to delete the addresses in the first column.
newtxt = (text.split("A0 05")[1].split()[4:][:5])
for i in newtxt:
if len(i) > 2:
newtxt.remove(i)
and so the full code looks like
import re
text = open('dump.txt').read()
regex = r"(A1\s+B2)(\s+\w+){4}((\s+\w{2}(\s\w{4})?){3})"
for ele in re.findall(regex,text,re.MULTILINE):
print(" ".join([ok for ok in ele[2].split() if len(ok) == 2]))
print(text.split("A1 B2")[1].split()[4:][:5])
#selects the next 5 elements in the array including the address in 1st col
newtxt = (text.split("A1 B2")[1].split()[4:][:5])
for i in newtxt:
if len(i) > 2:
newtxt.remove(i)
Input:
0120 86 1B 00 00 C1 FF 15 00 00 A1 B2 00 00 00 00 C2
0130 D1 E4 00 00 FF 04 01 54 00 EB 00 54 89 B8 00 00
Output:
C2 0130 D1 E4 00
C2 D1 E4 00

Using regex can extract text, but also you can complete it through split text.
Regex:
(A1\s+B2) string start with A1 + one or more space + B2
(\s+\w+){4} move 4 places
((\s+\w+(\s+\w{4})?){3}) extract 3 group of string, and There may be 4 unneeded characters in the group. Then combine them into one.
Split:
Note: If you have a very long text or multiple lines, don't use this way.
text.split("A1 B2")[1] split text to two part. the after is we need
.split() split by blank space and became the list ['FF', '15', 'A0', '05', 'C2', 'D1', 'E4', '00', '25', '04', '00']
[4:][:3] move 4 places, and select the first three
Test code:
import re
text = """0120 86 1B 00 A1 B2 FF 15 A0 05 C2 D1 E4 00 25 04 00
0120 86 1B 00 00 C1 FF 15 00 00 A1 B2 00 00 00 00 C2
0130 D1 E4 00 00 FF 04 01 54 00 EB 00 54 89 B8 00 00 """
regex = r"(A1\s+B2)(\s+\w+){4}((\s+\w{2}(\s\w{4})?){3})"
for ele in re.findall(regex,text,re.MULTILINE):
#remove the string we do not need, such as blankspace, 0123, \n
print(" ".join([ok for ok in ele[2].split() if len(ok) == 2]))
print( text.split("A1 B2")[1].split()[4:][:3] )
Output
C2 D1 E4
C2 D1 E4
['C2', 'D1', 'E4']

printing number of snmpwalk results

Were trying to make a script on a Ubuntu server that reads the number of results from an snmpwalk command, and then sending it to Cacti for graphing.
Since none of us have any kind of programming knowledge and from what we have tried, we havent succeed.
It will go like this:
the script runs: snmpwalk -v 1 -c public -Cp 10.59.193.141 .1.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1
The command will print
iso.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1.0.34.250.121.174.124 = Hex-STRING: 00 22 FA 79 AE 7C
iso.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1.0.35.20.11.246.64 = Hex-STRING: 00 23 14 0B F6 40
iso.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1.0.38.198.89.34.192 = Hex-STRING: 00 26 C6 59 22 C0
iso.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1.40.224.44.221.222.148 = Hex-STRING: 28 E0 2C DD DE 94
iso.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1.100.163.203.10.120.83 = Hex-STRING: 64 A3 CB 0A 78 53
iso.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1.120.214.240.8.133.165 = Hex-STRING: 78 D6 F0 08 85 A5
iso.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1.132.0.210.179.213.93 = Hex-STRING: 84 00 D2 B3 D5 5D
iso.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1.132.0.210.201.8.196 = Hex-STRING: 84 00 D2 C9 08 C4
iso.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1.140.112.90.108.236.188 = Hex-STRING: 8C 70 5A 6C EC BC
iso.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1.140.112.90.139.18.244 = Hex-STRING: 8C 70 5A 8B 12 F4
iso.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1.180.240.171.112.37.69 = Hex-STRING: B4 F0 AB 70 25 45
Variables found: 11
Then the script should somehow do: read until Variables found: and read "11", and then print "11".
So basically we want the script to filter out the number "11" in this case which we can use in Cacti for graphing. We've tried some scripts on google and looked around for information, but found nothing.
I think it should be easy if you know how to do it, but we are beginners at programming.
Thanks in advance!

Using perl, add following command after a pipe to extract the number you want:
... | perl -ne 'm/\A(?i)variables\s+/ and m/(\d+)\s*$/ and printf qq|%s\n|, $1 and exit'
It will print:
11

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I match multiline expressions with junk in the middle? - python

Related

How do I convert binary file data into numbers?

Python: grouping items already in a list and reverse them

How to extract IPv6 field attribute values from the header present in hexadecimal string value in a file using Python?

Reading EEPROM addresses in Python and perform operations

printing number of snmpwalk results

Categories

Resources