I have simple script for combining through ip addresses. I'd like to regex the ip from the following output
Starting Nmap 7.91 ( https://nmap.org ) at 2020-12-11 15:04 EST
Nmap scan report for host.com (127.0.0.1)
Host is up (0.14s latency).
I tried using this tool: https://pythex.org/. I was able to get a match with the following pattern
(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})
however this code returns 0 matches
regex = re.match("(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})",output)
print(regex)
the expected output should be 127.0.0.1. Any help with this would be greatly appreciated.
re.match matches a pattern at the beginning of the given string. It looks like what you want is re.findall or re.search:
output = '''
Starting Nmap 7.91 ( https://nmap.org ) at 2020-12-11 15:04 EST
Nmap scan report for host.com (127.0.0.1)
Host is up (0.14s latency).'''
regex = re.findall("(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})", output)
print(regex) # ['127.0.0.1']
You should use re.search
>>> import re
>>> text = 'Nmap scan report for host.com (127.0.0.1)'
>>> re.search(r"(?:[0-9]{1,3}\.){3}[0-9]{1,3}", text).group()
'127.0.0.1'
with re.match you can use it as follows:
output = 'Your output'
ip = re.match(r'\(([\d\.]*)\)').groups()[0]
Explaination : re.match will return a class which is surrounded by brackets and have "." and digits only. The groups() will then return all matching groups. the string at index[0] will be the match for it.
Related
I'm trying to extract pertinent information from a large textfile (1000+ lines), most of which isn't important:
ID: 67108866 Virtual-system: root, VPN Name: VPN-NAME-XYZ
Local Gateway: 1.1.1.1, Remote Gateway: 2.2.2.2
Traffic Selector Name: TS-1
Local Identity: ipv4(10.10.10.0-10.10.10.255)
Remote Identity: ipv4(10.20.10.0-10.20.10.255)
Version: IKEv2
DF-bit: clear, Copy-Outer-DSCP Disabled, Bind-interface: st0.287
Port: 500, Nego#: 0, Fail#: 0, Def-Del#: 0 Flag: 0x2c608b29
Multi-sa, Configured SAs# 1, Negotiated SAs#: 1
Tunnel events:
From this I need to extract only certain bits, and example output would be something like:
VPN Name: VPN-NAME-XYZ, Local Gateway: 1.1.1.1, Remote Gateway: 2.2.2.2
I've tried a couple different ways to get this, however my code keeps stopping on the 1st match, I need the code to match 1 line, then move onto the following line and match that:
with open('/path/to/vpn.txt', 'r') as file:
for vpn in file:
vpn = vpn.strip().lower()
name = "xyz"
if name in vpn:
print(vpn)
if "1.1.1.1" in vpn:
print(vpn)
I'm able to print both if I move the 2nd if in line:
with open('/path/to/vpn.txt', 'r') as file:
for vpn in file:
vpn = vpn.strip().lower()
name = "xyz"
if name in vpn:
print(vpn)
if "1.1.1.1" in vpn:
print(vpn)
Is it possible to match clauses on both lines?
I've tried a few different ways, with my indents and matches but can't get it, also the problem with print(vpn) is it's printing the entire line
Use regex to match the regions you need and then get all matched from the entire text. You need not do this line by line as well. An example below.
import re
found_text = []
with open('/path/to/vpn.txt', 'r') as file:
file_text = file.read()
[found_text.extend(found.split(",")) for found in [finds.group(0) for finds in
re.finditer(
r"((VPN Name|Local Gateway|Remote Gateway):.*)",
file_text)]]
# split by comma, if you want it to be splitted further
print(found_text)
This will yield an output like
['VPN Name: VPN-NAME-XYZ', 'Local Gateway: 1.1.1.1', ' Remote Gateway: 2.2.2.2']
I have content that I am reading in that I need to collect the emails from within. However, I just want to pull the email that comes after From:
Here is an example:
Recip: fhavor#gmail.com
Subject: Report results (Gd)
Headers: Received: from daem.com (unknown [127.1.1.1])
Date: Sat, 13 Feb 2021 13:11:42 +0000 (GMT)
From: Tavon Lo <lt35#gmail.com>
As you can see there are multiple emails but I want to only collect the email that comes after the From: part of the content.Which would be "lt35#gmail.com". So far I have a good regex that collects ALL the emails within the content.
EMAIL = r"((?:^|\b)(?:[^\s]+?\#(?:.+?)\[\.\][a-zA-Z]+)(?:$|\b))"
I am new to regex patterns so any ideas or suggestions as to how to improve the above pattern to only collect the emails that come after from: would highly be appreciated!
You can use
(?m)^From:[^<>\n\r]*<([^<>#]+#[^<>]+)>
See the regex demo.
Details:
(?m) - re.M inline modifier option
^ - start of a line
From: - a literal string
[^<>\n\r]* - zero or more chars other than <, >, CR and LF
< - a < char
([^<>#]+#[^<>]+) - Group 1: one or more chars other than <, > and #, then a # char and then one or more chars other than < and >
> - a > char.
See a Python demo:
import re
rx = re.compile(r'^From:[^<>\n\r]*<([^<>#]+#[^<>]+)>', re.M) # Define the regex
with open(your_file_path, 'r') as f: # Open file for reading
print(rx.findall(f.read())) # Get all the emails after From:
I have a very large netflow dataset that looks something like this:
192.168.1.3 www.123.com
192.168.1.6 api.123.com
192.168.1.3 blah.123.com
192.168.1.3 www.google.com
192.168.1.6 www.xyz.com
192.168.1.6 test.xyz.com
192.168.1.3 3.xyz.co.uk
192.168.1.3 www.blahxyzblah.com
....
I also have a much smaller dataset of wildcarded domains that look like this:
*.xyz.com
api.123.com
...
I'd like to be able to search my dataset and find all of the matches using python. So in the example above, I would match on:
192.168.1.6 www.xyz.com
192.168.1.6 test.xyz.com
192.168.1.6 api.123.com
My attempt to use the re module but cannot get it to match on anything.
for f in offendingsites:
for l in logs:
if re.search(f,l):
print(l)
The offending sites you have are not regexes, they are shell wildcards. However, you could use fnmatch.translate to convert them to regexes:
for f in offendingsites:
r = fnmatch.translate(f)
for l in logs:
if re.search(r, l):
print(l)
You could also use fnmatch.fnmatch() to do wildcard pattern searching.
Demo:
from fnmatch import fnmatch
with open("wildcards.txt") as offendingsites, open("dataset.txt") as logs:
for f in offendingsites:
for l in logs:
f, l = f.strip(), l.strip() # Remove whitespace
if fnmatch(l, f):
print(l)
Output:
192.168.1.6 www.xyz.com
192.168.1.6 test.xyz.com
I am trying to grab a hostname from configs and sometime there is a -p or -s added to the hostname in config, that is not really part of the hostname.
So I wrote this regex to fetch the hostname from the config file:
REGEX_HOSTNAME = re.compile('^hostname\s(?P<hostname>(\w|\W)+?)(-p|-P|-s|-S)?$\n',re.MULTILINE)
hostname = REGEX_HOSTNAME.search(config).group('hostname').lower().strip()
This is a sample part of the config that I using the regex on:
terminal width 120
hostname IGN-HSHST-HSH-01-P
domain-name sample.com
But in my result list of hostnames there is still the -P at the end.
ign-hshst-hsh-01-p
ign-hshst-hsh-02-p
ign-hshst-hsd-10
ign-hshst-hsh-01-S
ign-hshst-hsd-11
ign-hshst-hsh-02-s
In Regex 101 online tester it works and the -P is part of the last group. In my python (2.7) script it does not work.
Strange behavior is that when I use a slightly modified 2 pass regex it works:
REGEX_HOSTNAME = re.compile(r'^hostname\s*(?P<hostname>.*?)\n?$', re.MULTILINE)
REGEXP_CLUSTERNAME = re.compile('(?P<clustername>.*?)(?:-[ps])?$')
hostname = REGEX_HOSTNAME.search(config).group('hostname').lower().strip()
clustername = REGEXP_CLUSTERNAME.match(hostname).group('clustername')
Now Hostname has the full name and the clustername the one without the optional '-P' at the end.
You may use
import re
config=r"""terminal width 120
hostname IGN-HSHST-HSH-01-P
domain-name sample.com"""
REGEX_HOSTNAME = re.compile(r'^hostname\s*(.*?)(?:-[ps])?$', re.MULTILINE|re.I)
hostnames =[ h.lower().strip() for h in REGEX_HOSTNAME.findall(config) ]
print(hostnames) # => ['ign-hshst-hsh-01']
See the Python demo.
The ^hostname\s*(.*?)(?:-[ps])?$ regex matches:
^ - start of a line (due to re.MULTILINE, it matches a position after line breaks, too)
hostname - a word (case insensitive, due to re.I)
\s* - 0+ whitespaces
(.*?) - Group 1: zero or more chars other than line break chars, as few as possible
(?:-[ps])? - an optional occurrence of - and then p or s (case insensitive!)
$ - end of a line (due to re.MULTILINE).
See the regex demo online.
I have this string "IP 1.2.3.4 is currently trusted in the white list, but it is now using a new trusted certificate." in a log file. What I need to do is look for this message and extract the IP address (1.2.3.4) from the log file.
import os
import shutil
import optparse
import sys
def main():
file = open("messages", "r")
log_data = file.read()
file.close()
search_str = "is currently trusted in the white list, but it is now using a new trusted certificate."
index = log_data.find(search_str)
print index
return
if __name__ == '__main__':
main()
How do I extract the IP address? Your response is appreciated.
Really simple answer:
msg = "IP 1.2.3.4 is currently trusted in the white list, but it is now using a new trusted certificate."
parts = msg.split(' ', 2)
print parts[1]
results in:
1.2.3.4
You could also do REs if you wanted, but for something this simple...
There will be dozens of possible approaches, pros and cons depend on the details of your log file. One example, using the re module:
import re
x = "IP 1.2.3.4 is currently trusted in the white list, but it is now using a new trusted certificate."
pattern = "IP ([0-9\.]+) is currently trusted in the white list"
m = re.match(pattern, x)
for ip in m.groups():
print ip
If you want to print out every instance of that string in your log file, you'd do something like this:
import re
pattern = "(IP [9-0\.]+ is currently trusted in the white list, but it is now using a new trusted certificate.)"
m = re.match(pattern, log_data)
for match in m.groups():
print match
Use regular expressions.
Code like this:
import re
compiled = re.compile(r"""
.*? # Leading junk
(?P<ipaddress>\d+\.\d+\.\d+\.\d+) # IP address
.*? # Trailing junk
""", re.VERBOSE)
str = "IP 1.2.3.4 is currently trusted in the white list, but it is now using a new trusted certificate."
m = compiled.match(str)
print m.group("ipaddress")
And you get this:
>>> import re
>>>
>>> compiled = re.compile(r"""
... .*? # Leading junk
... (?P<ipaddress>\d+\.\d+\.\d+\.\d+) # IP address
... .*? # Trailing junk
... """, re.VERBOSE)
>>> str = "IP 1.2.3.4 is currently trusted in the white list, but it is now using a new trusted certificate."
>>> m = compiled.match(str)
>>> print m.group("ipaddress")
1.2.3.4
Also, I learned there there is a dictionary of matches, groupdict():
>>>> str = "Peer 10.11.6.224 is currently trusted in the white list, but it is now using a new trusted certificate. Consider removing its likely outdated white list entry."
>>>> m = compiled.match(str)
>>>> print m.groupdict()
{'ipaddress': '10.11.6.224'}
Later: fixed that. The initial '.*' was eating your first character match. Changed it to be non-greedy. For consistency (but not necessity), I changed the trailing match, too.
Regular expression is the way to go. But if you fill uncomfortably writing them, you can try a small parser that I wrote (https://github.com/hgrecco/stringparser). It translates a string format to a regular expression. In your case, you will do the following:
from stringparser import Parser
parser = Parser("IP {} is currently trusted in the white list, but it is now using a new trusted certificate.")
ip = parser(text)
If you have a file with multiple lines you can replace the last line by:
with open("log.txt", "r") as fp:
ips = [parser(line) for line in fp]
Good luck.