I am currently trying to match pattern for an eeprom dump text file to locate a certain address and then traverse 4 steps once I hit upon in the search. I have tried the following code for finding the pattern
regexp_list = ('A1 B2')
line = open("dump.txt", 'r').read()
pattern = re.compile(regexp_list)
matches = re.findall(pattern,line)
for match in matches:
print(match)
this scans the dump for A1 B2 and displays if found. I need to add more such addresses in search criteria for ex: 'C1 B2', 'D1 F1'.
I tried making the regexp_list as a list and not a tuple, but it didn't work.
This is one of the problem. Next when I hit upon the search, I want to traverse 4 places and then read the address from there on (See below).
Input:
0120 86 1B 00 A1 B2 FF 15 A0 05 C2 D1 E4 00 25 04 00
Here when the search finds A1 B2 pattern, I want to move 4 places i.e to save data from C2 D1 E4 from the dump.
Expected Output:
C2 D1 E4
I hope the explanation was clear.
#
Thanks to #kcorlidy
Here's the final piece of code which I had to enter to delete the addresses in the first column.
newtxt = (text.split("A0 05")[1].split()[4:][:5])
for i in newtxt:
if len(i) > 2:
newtxt.remove(i)
and so the full code looks like
import re
text = open('dump.txt').read()
regex = r"(A1\s+B2)(\s+\w+){4}((\s+\w{2}(\s\w{4})?){3})"
for ele in re.findall(regex,text,re.MULTILINE):
print(" ".join([ok for ok in ele[2].split() if len(ok) == 2]))
print(text.split("A1 B2")[1].split()[4:][:5])
#selects the next 5 elements in the array including the address in 1st col
newtxt = (text.split("A1 B2")[1].split()[4:][:5])
for i in newtxt:
if len(i) > 2:
newtxt.remove(i)
Input:
0120 86 1B 00 00 C1 FF 15 00 00 A1 B2 00 00 00 00 C2
0130 D1 E4 00 00 FF 04 01 54 00 EB 00 54 89 B8 00 00
Output:
C2 0130 D1 E4 00
C2 D1 E4 00
Using regex can extract text, but also you can complete it through split text.
Regex:
(A1\s+B2) string start with A1 + one or more space + B2
(\s+\w+){4} move 4 places
((\s+\w+(\s+\w{4})?){3}) extract 3 group of string, and There may be 4 unneeded characters in the group. Then combine them into one.
Split:
Note: If you have a very long text or multiple lines, don't use this way.
text.split("A1 B2")[1] split text to two part. the after is we need
.split() split by blank space and became the list ['FF', '15', 'A0', '05', 'C2', 'D1', 'E4', '00', '25', '04', '00']
[4:][:3] move 4 places, and select the first three
Test code:
import re
text = """0120 86 1B 00 A1 B2 FF 15 A0 05 C2 D1 E4 00 25 04 00
0120 86 1B 00 00 C1 FF 15 00 00 A1 B2 00 00 00 00 C2
0130 D1 E4 00 00 FF 04 01 54 00 EB 00 54 89 B8 00 00 """
regex = r"(A1\s+B2)(\s+\w+){4}((\s+\w{2}(\s\w{4})?){3})"
for ele in re.findall(regex,text,re.MULTILINE):
#remove the string we do not need, such as blankspace, 0123, \n
print(" ".join([ok for ok in ele[2].split() if len(ok) == 2]))
print( text.split("A1 B2")[1].split()[4:][:3] )
Output
C2 D1 E4
C2 D1 E4
['C2', 'D1', 'E4']
Related
I'm trying to match a multiline expression from some logs we have. The biggest problem is due to race-conditions, we sometimes have to use a custom print function with a mutex, and sometimes (when that's not necessary) we just use printf. This results in two types of logs.
My solution was this monstrosity:
changed key '(\w+)' value: <((([0-9a-f]{2} *)+)(?:\n)*(?:<\d+> \w+ (?:.*?] \[\d+\])\s*)*)*>
Explanation of the above regex:
changed key '(\w+)' value: - This is how we detect a print (and save the keyname in a capture group).
<{regex}> - The value output starts with < and ends with >
([0-9a-f]{2} *) - The bytes are hexadecimal pairs followed by an optional space (because last byte doesn't have a space). Let's call this capture group 4.
({group4}+) - One or more of group 4.
(?:\n)* - There can be 0 or more newlines after this "XX " pair. (non-capture)
(?:<\d+> \w+ (?:.*?] \[\d+\])\s*)* - There can be 0 or more prints of the timestamp. (non-capture)
This works for the Case 2 logs, but not for the Case 1 logs. In Case 1, for some reason only the last line is matched.
Essentially, I'm trying to match this (two capture groups):
changed key '(\w+)' value: <({only hexadecimal pairs})>
group 1: key
group 2: value
Below is the dummy cases (same value in all cases):
// Case 1
<22213> Nov 30 00:00:00.287 [D1] [128]changed key 'KEY_NAME' value: <ab ab ab ab 00 00 00
<22213> Nov 30 00:00:00.287 [D1] [128]
<22213> Nov 30 00:00:00.287 [D1] [128]00 04 00 00
<22213> Nov 30 00:00:00.287 [D1] [128]ff ff
<22213> Nov 30 00:00:00.287 [D1] [128]00 00 00 11 00 00 00 00 00 21>
// Case 2
changed key 'KEY_NAME' value: <ab ab ab ab 00 00 00 00 04 00 00 ff ff 00 00 00 11 00 00 00 00 00 21>
// Case 2 with some newlines in the middle
changed key 'KEY_NAME' value: <ab ab ab ab 00 00 00 00
04 00 00 ff
ff 00 00 00 11 00
00 00 00 00 21>
The key isn't always the same key, so the value (and the value length) can change.
This approach starts by first stripping out the leading log content of each line, leaving behind the content you want to target. After that, it does an re.findall search using a regex pattern similar to the one you are already using.
inp = """<22213> Nov 30 00:00:00.287 [D1] [128]changed key 'KEY_NAME' value: <ab ab ab ab 00 00 00
<22213> Nov 30 00:00:00.287 [D1] [128]
<22213> Nov 30 00:00:00.287 [D1] [128]00 04 00 00
<22213> Nov 30 00:00:00.287 [D1] [128]ff ff
<22213> Nov 30 00:00:00.287 [D1] [128]00 00 00 11 00 00 00 00 00 21>"""
inp = re.sub(r'^<.*?>.*?(?:\s+\[.*?\])+', '', inp, flags=re.M)
matches = re.findall(r"changed key '(\w+)' value: <(.*?)>", inp, flags=re.S)
matches = [(x[0], re.sub(r'\s+', ' ', x[1])) for x in matches]
print(matches)
This prints:
[('KEY_NAME', 'ab ab ab ab 00 00 00 00 04 00 00 ff ff 00 00 00 11 00 00 00 00 00 21')]
Assuming there could be unwanted values in between 'KEY_NAME' value: < and the closing >, we can use re.findall on the second group to match all hexadecimal values:
inp = re.sub(r'^<.*?>.*?(?:\s+\[.*?\])+', '', inp, flags=re.M)
matches = re.findall(r"changed key '(\w+)' value: <(.*?)>", inp, flags=re.S)
matches = [(x[0], ' '.join(re.findall(r'\b[a-f0-9]{2}\b', x[1]))) for x in matches]
print(matches) # output same as above
I am trying to decode datetime from byte form. I tried various methods (seconds, minutes, hours form 1-1-1970, minutes form 1-1-1 etc). I also tried mysql encoding (https://dev.mysql.com/doc/internals/en/date-and-time-data-type-representation.html) with no effect either.
Please help me find the datetime save key.
bf aa b8 c3 e5 2f
d7 be ba c3 e5 2f
80 a0 c0 c3 e5 2f
a7 fc bf c3 e5 2f
ae fd f2 c3 e5 2f
9e dd fa c3 e5 2f
c7 ce fa c3 e5 2f
b9 f5 82 c4 e5 2f
f8 95 f2 c3 e5 2f
Everything is around 01/14/2022 12:00
Each value is a timestamp encoded into VARINT. Long time ago I've made next functions to decode/encode VARINT:
def to_varint(src):
buf = b""
while True:
towrite = src & 0x7f
src >>= 7
if src:
buf += bytes((towrite | 0x80,))
else:
buf += bytes((towrite,))
break
return buf
def from_varint(src):
shift = 0
result = 0
for i in src:
result |= (i & 0x7f) << shift
shift += 7
return result
So using this function we can decode your values:
from datetime import datetime
...
values = """\
bf aa b8 c3 e5 2f
d7 be ba c3 e5 2f
80 a0 c0 c3 e5 2f
a7 fc bf c3 e5 2f
ae fd f2 c3 e5 2f
9e dd fa c3 e5 2f
c7 ce fa c3 e5 2f
b9 f5 82 c4 e5 2f
f8 95 f2 c3 e5 2f"""
for value in values.splitlines():
timestamp = from_varint(bytes.fromhex(value))
dt = datetime.fromtimestamp(timestamp / 1000)
print(dt)
To get current timestamp in this encoding you can use next code:
current = to_varint(int(datetime.now().timestamp() * 1000)).hex(" ")
An IPv6 header has the following value :
68 01 00 00 31 02 FF 2A 01 3F 4D 9C 7E 11 14 56 19 DE A0 BD CD 17 FF CD DF 01 03 04 BC 2B 3A 4E 9D AB DE 9D AE 07 FF (IN TXT FILE)
After removing white spaces:
680100003102FF2A013F4D9C7E11145619DEA0BDCD17FFCDDF010304BC2B3A4E9DABDE9DAE07FF
The above header is present in a file and I'm trying to extract all the field attributes present in the string as mentioned in the picture.
Here the picture contains header field attribute details:
I tried slicing and checking if the value falls below the upper limit(by using byte size) but it doesn't work when alphabets (in the hex format) come into the picture.
Is there any optimal and error-free way to do this generically in python?
Were trying to make a script on a Ubuntu server that reads the number of results from an snmpwalk command, and then sending it to Cacti for graphing.
Since none of us have any kind of programming knowledge and from what we have tried, we havent succeed.
It will go like this:
the script runs: snmpwalk -v 1 -c public -Cp 10.59.193.141 .1.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1
The command will print
iso.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1.0.34.250.121.174.124 = Hex-STRING: 00 22 FA 79 AE 7C
iso.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1.0.35.20.11.246.64 = Hex-STRING: 00 23 14 0B F6 40
iso.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1.0.38.198.89.34.192 = Hex-STRING: 00 26 C6 59 22 C0
iso.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1.40.224.44.221.222.148 = Hex-STRING: 28 E0 2C DD DE 94
iso.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1.100.163.203.10.120.83 = Hex-STRING: 64 A3 CB 0A 78 53
iso.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1.120.214.240.8.133.165 = Hex-STRING: 78 D6 F0 08 85 A5
iso.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1.132.0.210.179.213.93 = Hex-STRING: 84 00 D2 B3 D5 5D
iso.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1.132.0.210.201.8.196 = Hex-STRING: 84 00 D2 C9 08 C4
iso.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1.140.112.90.108.236.188 = Hex-STRING: 8C 70 5A 6C EC BC
iso.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1.140.112.90.139.18.244 = Hex-STRING: 8C 70 5A 8B 12 F4
iso.3.6.1.4.1.11.2.14.11.6.4.1.1.8.1.1.2.1.180.240.171.112.37.69 = Hex-STRING: B4 F0 AB 70 25 45
Variables found: 11
Then the script should somehow do: read until Variables found: and read "11", and then print "11".
So basically we want the script to filter out the number "11" in this case which we can use in Cacti for graphing. We've tried some scripts on google and looked around for information, but found nothing.
I think it should be easy if you know how to do it, but we are beginners at programming.
Thanks in advance!
Using perl, add following command after a pipe to extract the number you want:
... | perl -ne 'm/\A(?i)variables\s+/ and m/(\d+)\s*$/ and printf qq|%s\n|, $1 and exit'
It will print:
11
I'm using mako templates to generate specialized config files. Some of these files contain extended ASCII chars (>127), but mako chokes saying that the chars are out of range when I use:
## -*- coding: ascii -*-
So I'm wondering if perhaps there's something like:
## -*- coding: eascii -*-
That I can use that will be ok with the range(128, 256) chars.
EDIT:
Here's the dump of the offending section of the file:
000001b0 39 c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 ca cb cc cd ce |9...............|
000001c0 cf d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 da db dc dd de |................|
000001d0 df e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 ea eb ec ed ee |................|
000001e0 ef f0 f1 f2 f3 f4 f5 f6 f7 f8 f9 fa fb fc fd fe |................|
000001f0 ff 5d 2b 28 27 73 29 3f 22 0a 20 20 20 20 20 20 |.]+('s)?". |
00000200 20 20 74 6f 6b 65 6e 3a 20 57 4f 52 44 20 20 20 | token: WORD |
00000210 20 20 22 5b 41 2d 5a 61 2d 7a 30 2d 39 c0 c1 c2 | "[A-Za-z0-9...|
00000220 c3 c4 c5 c6 c7 c8 c9 ca cb cc cd ce cf d0 d1 d2 |................|
00000230 d3 d4 d5 d6 d7 d8 d9 da db dc dd de df e0 e1 e2 |................|
00000240 e3 e4 e5 e6 e7 e8 e9 ea eb ec ed ee ef f0 f1 f2 |................|
00000250 f3 f4 f5 f6 f7 f8 f9 fa fb fc fd fe ff 5d 2b 28 |.............]+(|
The first character that mako complains about is 000001b4. If I remove this section, everything works fine. With the section inserted, mako complains:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 19: ordinal not in range(128)
It's the same complaint whether I use 'ascii' or 'latin-1' in the magic comment line.
Thanks!
Greg
Short answer
Use cp437 as the encoding for some retro DOS fun. All byte values greater than or equal to 32 decimal, except 127, are mapped to displayable characters in this encoding. Then use cp037 as the encoding for a truly trippy time. And then ask yourself how do you really know which of these, if either of them, is "correct".
Long answer
There is something you must unlearn: the absolute equivalence of byte values and characters.
Many basic text editors and debugging tools today, and also the Python language specification, imply an absolute equivalence between bytes and characters when in reality none exists. It is not true that 74 6f 6b 65 6e is "token". Only for ASCII-compatible character encodings is this correspondence valid. In EBCDIC, which is still quite common today, "token" corresponds to byte values a3 96 92 85 95.
So while the Python 2.6 interpreter happily evaluates 'text' == u'text' as True, it shouldn't, because they are only equivalent under the assumption of ASCII or a compatible encoding, and even then they should not be considered equal. (At least '\xfd' == u'\xfd' is False and gets you a warning for trying.) Python 3.1 evaluates 'text' == b'text' as False. But even the acceptance of this expression by the interpreter implies an absolute equivalence of byte values and characters, because the expression b'text' is taken to mean "the byte-string you get when you apply the ASCII encoding to 'text'" by the interpreter.
As far as I know, every programming language in widespread use today carries an implicit use of ASCII or ISO-8859-1 (Latin-1) character encoding somewhere in its design. In C, the char data type is really a byte. I saw one Java 1.4 VM where the constructor java.lang.String(byte[] data) assumed ISO-8859-1 encoding. Most compilers and interpreters assume ASCII or ISO-8859-1 encoding of source code (some let you change it). In Java, string length is really the UTF-16 code unit length, which is arguably wrong for characters U+10000 and above. In Unix, filenames are byte-strings interpreted according to terminal settings, allowing you to open('a\x08b', 'w').write('Say my name!').
So we have all been trained and conditioned by the tools we have learned to trust, to believe that 'A' is 0x41. But it isn't. 'A' is a character and 0x41 is a byte and they are simply not equal.
Once you have become enlightened on this point, you will have no trouble resolving your issue. You have simply to decide what component in the software is assuming the ASCII encoding for these byte values, and how to either change that behavior or ensure that different byte values appear instead.
PS: The phrases "extended ASCII" and "ANSI character set" are misnomers.
Try
## -*- coding: UTF-8 -*-
or
## -*- coding: latin-1 -*-
or
## -*- coding: cp1252 -*-
depending on what you really need. The last two are similar except:
The Windows-1252 codepage coincides with ISO-8859-1 for all codes except the range 128 to 159 (hex 80 to 9F), where the little-used C1 controls are replaced with additional characters. Windows-28591 is the actual ISO-8859-1 codepage.
where ISO-8859-1 is the official name for latin-1.
Try examining your data with a critical eye:
000001b0 39 c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 ca cb cc cd ce |9...............|
000001c0 cf d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 da db dc dd de |................|
000001d0 df e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 ea eb ec ed ee |................|
000001e0 ef f0 f1 f2 f3 f4 f5 f6 f7 f8 f9 fa fb fc fd fe |................|
000001f0 ff 5d 2b 28 27 73 29 3f 22 0a 20 20 20 20 20 20 |.]+('s)?". |
00000200 20 20 74 6f 6b 65 6e 3a 20 57 4f 52 44 20 20 20 | token: WORD |
00000210 20 20 22 5b 41 2d 5a 61 2d 7a 30 2d 39 c0 c1 c2 | "[A-Za-z0-9...|
00000220 c3 c4 c5 c6 c7 c8 c9 ca cb cc cd ce cf d0 d1 d2 |................|
00000230 d3 d4 d5 d6 d7 d8 d9 da db dc dd de df e0 e1 e2 |................|
00000240 e3 e4 e5 e6 e7 e8 e9 ea eb ec ed ee ef f0 f1 f2 |................|
00000250 f3 f4 f5 f6 f7 f8 f9 fa fb fc fd fe ff 5d 2b 28 |.............]+(|
The stuff in bold font is two lots of (each byte from 0xc0 to 0xff both inclusive). You appear to have a binary file (perhaps a dump of compiled regex(es)), not a text file. I suggest that you read it as a binary file, rather than paste it into your Python source file. You should also read the mako docs to find out what it is expecting.
Update after eyeballing the text part of your dump: You may well be able to express this in ASCII-only regexes e.g. you would have a line containing
token: WORD "[A-Za-z0-9\xc0-\xff]+(etc)etc"