Negative look ahead python regex

Negative look ahead python regex - python

I would like to regex match a sequence of bytes when the string '02 d0' does not occur at a specific position in the string. The position where this string of two bytes cannot occur are byte positions 6 and 7 starting with the 0th byte on the right hand side.
This is what I have been using for testing:
#!/usr/bin/python
import re
p0 = re.compile('^24 [\da-f]{2} 03 (01|03) [\da-f]{2} [\da-f]{2} [\da-f]{2} (([^0])| (0[^2])|(02 [^d])|(02 d[^0])) 01 c2 [\da-f]{2} [\da-f]{2} [\da-f]{2} 23')
p1 = re.compile('^24 [\da-f]{2} 03 (01|03) [\da-f]{2} [\da-f]{2} [\da-f]{2} (([^0])|(0[^2])|(02 [^d])|(02 d[^0])) 01')
p2 = re.compile('^24 [\da-f]{2} 03 (01|03) [\da-f]{2} [\da-f]{2} [\da-f]{2} (([^0])|(0[^2])|(02 [^d])|(02 d[^0]))')
p3 = re.compile('^24 [\da-f]{2} 03 (01|03) [\da-f]{2} [\da-f]{2} [\da-f]{2} (?!02 d0) 01')
p4 = re.compile('^24 [\da-f]{2} 03 (01|03) [\da-f]{2} [\da-f]{2} [\da-f]{2} (?!02 d0)')
yes = '24 0f 03 01 42 ff 00 04 a2 01 c2 00 c5 e5 23'
no = '24 0f 03 01 42 ff 00 02 d0 01 c2 00 c5 e5 23'
print p0.match(yes) # fail
print p0.match(no) # fail
print '\n'
print p1.match(yes) # fail
print p1.match(no) # fail
print '\n'
print p2.match(yes) # PASS
print p2.match(no) # fail
print '\n'
print p3.match(yes) # fail
print p3.match(no) # fail
print '\n'
print p4.match(yes) # PASS
print p4.match(no) # fail
I looked at this example, but that method is less restrictive than I need. Could someone explain why I can only match properly when the negative look ahead is at the end of the string? What do I need to do to match when '02 d0' does not occur in this specific bit position?

Lookaheads are "zero-width", meaning they do not consume any characters. For example, these two expressions will never match:
(?=foo)bar
(?!foo)foo
To make sure a number is not some specific number, you could use:
(?!42)\d\d # will match two digits that are not 42
In your case it could look like:
(?!02)[\da-f]{2} (?!0d)[\da-f]{2}
or:
(?!02 d0)[\da-f]{2} [\da-f]{2}

Related

How do I match multiline expressions with junk in the middle?

I'm trying to match a multiline expression from some logs we have. The biggest problem is due to race-conditions, we sometimes have to use a custom print function with a mutex, and sometimes (when that's not necessary) we just use printf. This results in two types of logs.
My solution was this monstrosity:
changed key '(\w+)' value: <((([0-9a-f]{2} *)+)(?:\n)*(?:<\d+> \w+ (?:.*?] \[\d+\])\s*)*)*>
Explanation of the above regex:
changed key '(\w+)' value: - This is how we detect a print (and save the keyname in a capture group).
<{regex}> - The value output starts with < and ends with >
([0-9a-f]{2} *) - The bytes are hexadecimal pairs followed by an optional space (because last byte doesn't have a space). Let's call this capture group 4.
({group4}+) - One or more of group 4.
(?:\n)* - There can be 0 or more newlines after this "XX " pair. (non-capture)
(?:<\d+> \w+ (?:.*?] \[\d+\])\s*)* - There can be 0 or more prints of the timestamp. (non-capture)
This works for the Case 2 logs, but not for the Case 1 logs. In Case 1, for some reason only the last line is matched.
Essentially, I'm trying to match this (two capture groups):
changed key '(\w+)' value: <({only hexadecimal pairs})>
group 1: key
group 2: value
Below is the dummy cases (same value in all cases):
// Case 1
<22213> Nov 30 00:00:00.287 [D1] [128]changed key 'KEY_NAME' value: <ab ab ab ab 00 00 00
<22213> Nov 30 00:00:00.287 [D1] [128]
<22213> Nov 30 00:00:00.287 [D1] [128]00 04 00 00
<22213> Nov 30 00:00:00.287 [D1] [128]ff ff
<22213> Nov 30 00:00:00.287 [D1] [128]00 00 00 11 00 00 00 00 00 21>
// Case 2
changed key 'KEY_NAME' value: <ab ab ab ab 00 00 00 00 04 00 00 ff ff 00 00 00 11 00 00 00 00 00 21>
// Case 2 with some newlines in the middle
changed key 'KEY_NAME' value: <ab ab ab ab 00 00 00 00
04 00 00 ff
ff 00 00 00 11 00
00 00 00 00 21>
The key isn't always the same key, so the value (and the value length) can change.

This approach starts by first stripping out the leading log content of each line, leaving behind the content you want to target. After that, it does an re.findall search using a regex pattern similar to the one you are already using.
inp = """<22213> Nov 30 00:00:00.287 [D1] [128]changed key 'KEY_NAME' value: <ab ab ab ab 00 00 00
<22213> Nov 30 00:00:00.287 [D1] [128]
<22213> Nov 30 00:00:00.287 [D1] [128]00 04 00 00
<22213> Nov 30 00:00:00.287 [D1] [128]ff ff
<22213> Nov 30 00:00:00.287 [D1] [128]00 00 00 11 00 00 00 00 00 21>"""
inp = re.sub(r'^<.*?>.*?(?:\s+\[.*?\])+', '', inp, flags=re.M)
matches = re.findall(r"changed key '(\w+)' value: <(.*?)>", inp, flags=re.S)
matches = [(x[0], re.sub(r'\s+', ' ', x[1])) for x in matches]
print(matches)
This prints:
[('KEY_NAME', 'ab ab ab ab 00 00 00 00 04 00 00 ff ff 00 00 00 11 00 00 00 00 00 21')]
Assuming there could be unwanted values in between 'KEY_NAME' value: < and the closing >, we can use re.findall on the second group to match all hexadecimal values:
inp = re.sub(r'^<.*?>.*?(?:\s+\[.*?\])+', '', inp, flags=re.M)
matches = re.findall(r"changed key '(\w+)' value: <(.*?)>", inp, flags=re.S)
matches = [(x[0], ' '.join(re.findall(r'\b[a-f0-9]{2}\b', x[1]))) for x in matches]
print(matches) # output same as above

Python: grouping items already in a list and reverse them

I have a binary file likes this:
00 01 02 04 03 03 03 03 00 05 06 03 03 03 03 03 00 07 03 03 03 03 03 03 ...
and I would like to make groups of 8 items each
[00 01 02 04 03 03 03 03] [00 05 06 03 03 03 03 03] [00 07 03 03 03 03 03 03]...
and then reverse the items inside each group like this:
[03 03 03 03 04 02 01 00] [03 03 03 03 03 06 05 00] [03 03 03 03 03 03 07 00]
I tried reverse() but it reverse all the list.
I've imagined something like that: in a loop I should count until 8 (or 7), make a group, reverse it, and then increment the row, count 8, reverse and so on but I am not able to code that.
I have tried
i=0
for item in (list_reverse):
i+=1
if i>8:
list_reverse.reverse()
i=0
but it doesn't work.
Maybe I should try a nested loop?

.split() the strings, then loop through it.
t = """00 01 02 04 03 03 03 03
00 05 06 03 03 03 03 03
00 07 03 03 03 03 03 03"""
out = ""
lines = t.split("\n")
for n, line in enumerate(lines):
lst = line.split(" ")
c = 0
reversed_lst = ""
while c < len(lst):
reversed_lst += (lst[len(lst)- c -1]) + " "; c+=1
if n != len(lines) - 1:
out += reversed_lst + "\n"
else:
out += reversed_lst
print(out)
Output:
03 03 03 03 04 02 01 00
03 03 03 03 03 06 05 00
03 03 03 03 03 03 07 00

This is a good usecase for Python's builtin bytearray. First, you can open the binary input file and use a bytearray to store its contents:
with open("binary.file", "rb") as f:
bytes = bytearray(f.read())
Then, we'll use your algorithm to loop over the bytes and store them in a variable called group (which is also a bytearray) every 8 iterations:
i = 0
groups = []
group = bytearray()
for byte in bytes:
i += 1
group.append(byte)
if i == 8:
groups.append(group)
i = 0
group = bytearray()
Just as quick sanity check, test the value of variable i because if it's not zero by now the final group would be less than 8 bytes long:
if i != 0:
raise EOFError("Input file does not align to 8 byte boundary!")
Finally, we'll reverse each group and print the output:
for group in groups:
group.reverse()
print(groups)
Depending on your usecase, you could also concatenate the reversed bytes and store them in another file or even overwrite the same file. Although my guess is if you would do that to any ordinary binary files like a JPEG or an EXE you will break them completely. Luckily, you could run the program again to restore them!

How to extract IPv6 field attribute values from the header present in hexadecimal string value in a file using Python?

An IPv6 header has the following value :
68 01 00 00 31 02 FF 2A 01 3F 4D 9C 7E 11 14 56 19 DE A0 BD CD 17 FF CD DF 01 03 04 BC 2B 3A 4E 9D AB DE 9D AE 07 FF (IN TXT FILE)
After removing white spaces:
680100003102FF2A013F4D9C7E11145619DEA0BDCD17FFCDDF010304BC2B3A4E9DABDE9DAE07FF
The above header is present in a file and I'm trying to extract all the field attributes present in the string as mentioned in the picture.
Here the picture contains header field attribute details:
I tried slicing and checking if the value falls below the upper limit(by using byte size) but it doesn't work when alphabets (in the hex format) come into the picture.
Is there any optimal and error-free way to do this generically in python?

Reading EEPROM addresses in Python and perform operations

I am currently trying to match pattern for an eeprom dump text file to locate a certain address and then traverse 4 steps once I hit upon in the search. I have tried the following code for finding the pattern
regexp_list = ('A1 B2')
line = open("dump.txt", 'r').read()
pattern = re.compile(regexp_list)
matches = re.findall(pattern,line)
for match in matches:
print(match)
this scans the dump for A1 B2 and displays if found. I need to add more such addresses in search criteria for ex: 'C1 B2', 'D1 F1'.
I tried making the regexp_list as a list and not a tuple, but it didn't work.
This is one of the problem. Next when I hit upon the search, I want to traverse 4 places and then read the address from there on (See below).
Input:
0120 86 1B 00 A1 B2 FF 15 A0 05 C2 D1 E4 00 25 04 00
Here when the search finds A1 B2 pattern, I want to move 4 places i.e to save data from C2 D1 E4 from the dump.
Expected Output:
C2 D1 E4
I hope the explanation was clear.
#
Thanks to #kcorlidy
Here's the final piece of code which I had to enter to delete the addresses in the first column.
newtxt = (text.split("A0 05")[1].split()[4:][:5])
for i in newtxt:
if len(i) > 2:
newtxt.remove(i)
and so the full code looks like
import re
text = open('dump.txt').read()
regex = r"(A1\s+B2)(\s+\w+){4}((\s+\w{2}(\s\w{4})?){3})"
for ele in re.findall(regex,text,re.MULTILINE):
print(" ".join([ok for ok in ele[2].split() if len(ok) == 2]))
print(text.split("A1 B2")[1].split()[4:][:5])
#selects the next 5 elements in the array including the address in 1st col
newtxt = (text.split("A1 B2")[1].split()[4:][:5])
for i in newtxt:
if len(i) > 2:
newtxt.remove(i)
Input:
0120 86 1B 00 00 C1 FF 15 00 00 A1 B2 00 00 00 00 C2
0130 D1 E4 00 00 FF 04 01 54 00 EB 00 54 89 B8 00 00
Output:
C2 0130 D1 E4 00
C2 D1 E4 00

Using regex can extract text, but also you can complete it through split text.
Regex:
(A1\s+B2) string start with A1 + one or more space + B2
(\s+\w+){4} move 4 places
((\s+\w+(\s+\w{4})?){3}) extract 3 group of string, and There may be 4 unneeded characters in the group. Then combine them into one.
Split:
Note: If you have a very long text or multiple lines, don't use this way.
text.split("A1 B2")[1] split text to two part. the after is we need
.split() split by blank space and became the list ['FF', '15', 'A0', '05', 'C2', 'D1', 'E4', '00', '25', '04', '00']
[4:][:3] move 4 places, and select the first three
Test code:
import re
text = """0120 86 1B 00 A1 B2 FF 15 A0 05 C2 D1 E4 00 25 04 00
0120 86 1B 00 00 C1 FF 15 00 00 A1 B2 00 00 00 00 C2
0130 D1 E4 00 00 FF 04 01 54 00 EB 00 54 89 B8 00 00 """
regex = r"(A1\s+B2)(\s+\w+){4}((\s+\w{2}(\s\w{4})?){3})"
for ele in re.findall(regex,text,re.MULTILINE):
#remove the string we do not need, such as blankspace, 0123, \n
print(" ".join([ok for ok in ele[2].split() if len(ok) == 2]))
print( text.split("A1 B2")[1].split()[4:][:3] )
Output
C2 D1 E4
C2 D1 E4
['C2', 'D1', 'E4']

printing number of snmpwalk results

Were trying to make a script on a Ubuntu server that reads the number of results from an snmpwalk command, and then sending it to Cacti for graphing.
Since none of us have any kind of programming knowledge and from what we have tried, we havent succeed.
It will go like this:
the script runs: snmpwalk -v 1 -c public -Cp 10.59.193.141 .1.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1
The command will print
iso.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1.0.34.250.121.174.124 = Hex-STRING: 00 22 FA 79 AE 7C
iso.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1.0.35.20.11.246.64 = Hex-STRING: 00 23 14 0B F6 40
iso.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1.0.38.198.89.34.192 = Hex-STRING: 00 26 C6 59 22 C0
iso.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1.40.224.44.221.222.148 = Hex-STRING: 28 E0 2C DD DE 94
iso.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1.100.163.203.10.120.83 = Hex-STRING: 64 A3 CB 0A 78 53
iso.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1.120.214.240.8.133.165 = Hex-STRING: 78 D6 F0 08 85 A5
iso.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1.132.0.210.179.213.93 = Hex-STRING: 84 00 D2 B3 D5 5D
iso.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1.132.0.210.201.8.196 = Hex-STRING: 84 00 D2 C9 08 C4
iso.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1.140.112.90.108.236.188 = Hex-STRING: 8C 70 5A 6C EC BC
iso.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1.140.112.90.139.18.244 = Hex-STRING: 8C 70 5A 8B 12 F4
iso.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1.180.240.171.112.37.69 = Hex-STRING: B4 F0 AB 70 25 45
Variables found: 11
Then the script should somehow do: read until Variables found: and read "11", and then print "11".
So basically we want the script to filter out the number "11" in this case which we can use in Cacti for graphing. We've tried some scripts on google and looked around for information, but found nothing.
I think it should be easy if you know how to do it, but we are beginners at programming.
Thanks in advance!

Using perl, add following command after a pipe to extract the number you want:
... | perl -ne 'm/\A(?i)variables\s+/ and m/(\d+)\s*$/ and printf qq|%s\n|, $1 and exit'
It will print:
11

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Negative look ahead python regex - python

Related

How do I match multiline expressions with junk in the middle?

Python: grouping items already in a list and reverse them

How to extract IPv6 field attribute values from the header present in hexadecimal string value in a file using Python?

Reading EEPROM addresses in Python and perform operations

printing number of snmpwalk results

Categories

Resources