Custom regex pattern for matching email addresses

Custom regex pattern for matching email addresses - python

I have content that I am reading in that I need to collect the emails from within. However, I just want to pull the email that comes after From:
Here is an example:
Recip: fhavor#gmail.com
Subject: Report results (Gd)
Headers: Received: from daem.com (unknown [127.1.1.1])
Date: Sat, 13 Feb 2021 13:11:42 +0000 (GMT)
From: Tavon Lo <lt35#gmail.com>
As you can see there are multiple emails but I want to only collect the email that comes after the From: part of the content.Which would be "lt35#gmail.com". So far I have a good regex that collects ALL the emails within the content.
EMAIL = r"((?:^|\b)(?:[^\s]+?\#(?:.+?)\[\.\][a-zA-Z]+)(?:$|\b))"
I am new to regex patterns so any ideas or suggestions as to how to improve the above pattern to only collect the emails that come after from: would highly be appreciated!

You can use
(?m)^From:[^<>\n\r]*<([^<>#]+#[^<>]+)>
See the regex demo.
Details:
(?m) - re.M inline modifier option
^ - start of a line
From: - a literal string
[^<>\n\r]* - zero or more chars other than <, >, CR and LF
< - a < char
([^<>#]+#[^<>]+) - Group 1: one or more chars other than <, > and #, then a # char and then one or more chars other than < and >
> - a > char.
See a Python demo:
import re
rx = re.compile(r'^From:[^<>\n\r]*<([^<>#]+#[^<>]+)>', re.M) # Define the regex
with open(your_file_path, 'r') as f: # Open file for reading
print(rx.findall(f.read())) # Get all the emails after From:

Related

python re match ip address

I have simple script for combining through ip addresses. I'd like to regex the ip from the following output
Starting Nmap 7.91 ( https://nmap.org ) at 2020-12-11 15:04 EST
Nmap scan report for host.com (127.0.0.1)
Host is up (0.14s latency).
I tried using this tool: https://pythex.org/. I was able to get a match with the following pattern
(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})
however this code returns 0 matches
regex = re.match("(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})",output)
print(regex)
the expected output should be 127.0.0.1. Any help with this would be greatly appreciated.

re.match matches a pattern at the beginning of the given string. It looks like what you want is re.findall or re.search:
output = '''
Starting Nmap 7.91 ( https://nmap.org ) at 2020-12-11 15:04 EST
Nmap scan report for host.com (127.0.0.1)
Host is up (0.14s latency).'''
regex = re.findall("(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})", output)
print(regex) # ['127.0.0.1']

You should use re.search
>>> import re
>>> text = 'Nmap scan report for host.com (127.0.0.1)'
>>> re.search(r"(?:[0-9]{1,3}\.){3}[0-9]{1,3}", text).group()
'127.0.0.1'

with re.match you can use it as follows:
output = 'Your output'
ip = re.match(r'\(([\d\.]*)\)').groups()[0]
Explaination : re.match will return a class which is surrounded by brackets and have "." and digits only. The groups() will then return all matching groups. the string at index[0] will be the match for it.

Regex to match ERROR or INFO, but with different actions depending on the match

I am iterating through a log file to list how many ERROR messages and INFO messages each user has generated. I am trying to do it in one regex pattern and add the captured line to a different list depending on if the line is an ERROR or an INFO messages.
The contents of the log file look like this:
Jan 31 00:21:30 ubuntu.local ticky: ERROR The ticket was modified while updating (john123)
Jan 31 00:44:34 ubuntu.local ticky: INFO Closed ticket [#1754] (jack456)

First I want to point out that it would likely be far easier to do this with simple if statements, checking if the substrings are in the log line
for line in log_contents:
if "ERROR" in line:
# do error actions
elif "INFO" in line:
# do info actions
If you are determined to use regex something like this should work, and I'd suggeting looking at the docs for the re python package for more info on useing regex in python:
import re
# read log file into `log_contents`
pattern = re.compile("(ERROR|INFO)")
for line in log_contents:
match = pattern.search(line)
if match is not None:
if match.group(0) == "Error":
# do error actions
elif match.group(0) == "INFO":
# do info actions

Regex not working in python, but in online regex tools

I am trying to grab a hostname from configs and sometime there is a -p or -s added to the hostname in config, that is not really part of the hostname.
So I wrote this regex to fetch the hostname from the config file:
REGEX_HOSTNAME = re.compile('^hostname\s(?P<hostname>(\w|\W)+?)(-p|-P|-s|-S)?$\n',re.MULTILINE)
hostname = REGEX_HOSTNAME.search(config).group('hostname').lower().strip()
This is a sample part of the config that I using the regex on:
terminal width 120
hostname IGN-HSHST-HSH-01-P
domain-name sample.com
But in my result list of hostnames there is still the -P at the end.
ign-hshst-hsh-01-p
ign-hshst-hsh-02-p
ign-hshst-hsd-10
ign-hshst-hsh-01-S
ign-hshst-hsd-11
ign-hshst-hsh-02-s
In Regex 101 online tester it works and the -P is part of the last group. In my python (2.7) script it does not work.
Strange behavior is that when I use a slightly modified 2 pass regex it works:
REGEX_HOSTNAME = re.compile(r'^hostname\s*(?P<hostname>.*?)\n?$', re.MULTILINE)
REGEXP_CLUSTERNAME = re.compile('(?P<clustername>.*?)(?:-[ps])?$')
hostname = REGEX_HOSTNAME.search(config).group('hostname').lower().strip()
clustername = REGEXP_CLUSTERNAME.match(hostname).group('clustername')
Now Hostname has the full name and the clustername the one without the optional '-P' at the end.

You may use
import re
config=r"""terminal width 120
hostname IGN-HSHST-HSH-01-P
domain-name sample.com"""
REGEX_HOSTNAME = re.compile(r'^hostname\s*(.*?)(?:-[ps])?$', re.MULTILINE|re.I)
hostnames =[ h.lower().strip() for h in REGEX_HOSTNAME.findall(config) ]
print(hostnames) # => ['ign-hshst-hsh-01']
See the Python demo.
The ^hostname\s*(.*?)(?:-[ps])?$ regex matches:
^ - start of a line (due to re.MULTILINE, it matches a position after line breaks, too)
hostname - a word (case insensitive, due to re.I)
\s* - 0+ whitespaces
(.*?) - Group 1: zero or more chars other than line break chars, as few as possible
(?:-[ps])? - an optional occurrence of - and then p or s (case insensitive!)
$ - end of a line (due to re.MULTILINE).
See the regex demo online.

I need to find all instances of a phrase in a file and then print a sorted list of the next word after that phrase

I am trying to run a program where I find all of the times that the phrase "Invalid user" appears and then have the program find the username for each invalid user entry and have it print off a list of users in a sorted manner alphabetically. Unfortunately, every time I run the program it prints the entire line and not just the user names. This is an example of the file that I am wanting to search through:
May 26 06:25:01 instance-1 CRON[19549]: pam_unix(cron:session): session closed for user root
May 26 06:38:14 instance-1 sshd[19783]: Connection closed by 210.187.175.103 port 60536 [preauth]
May 26 06:39:05 instance-1 sshd[19797]: Invalid user backups from 182.254.146.167 port 58682
May 26 06:39:05 instance-1 sshd[19797]: input_userauth_request: invalid user backups [preauth]
In the third line, you can see "Invalid user" followed by "backups" which is the user name I am wanting to have printed at the end. Here is the code that I have been working with, but it prints off "Invalid user:" and then the entire line after that:
invalid_users = []
substr = "invalid user".lower()
with open('auth.log', 'rt') as myfile:
for line in myfile:
if line.lower().find(substr) != +1:
invalid_users.append("Invalid user" + ": "+ line.lstrip("/n"))
for users in invalid_users:
print(users)
I wish I was good at Python, but I am very inexperienced with it, and have not had much luck learning it yet. Any help would be appreciated.

This kind of problem - pattern matching in text - can be solved using regular expressions (implemented in the re module in Python's standard library).
Let's say we have the lines from the question collected in a list:
>>> print(lines)
['May 26 06:25:01 instance-1 CRON[19549]: pam_unix(cron:session): session closed for user root', 'May 26 06:38:14 instance-1 sshd[19783]: Connection closed by 210.187.175.103 port 60536 [preauth]', 'May 26 06:39:05 instance-1 sshd[19797]: Invalid user backups from 182.254.146.167 port 58682', 'May 26 06:39:05 instance-1 sshd[19797]: input_userauth_request: invalid user backups [preauth]']
Let's import the re module and define a pattern to be matched
>>> import re
>>> pattern = r'[Ii]nvalid user\s+(\w+)\s+'
The pattern to be matched is:
either an "I" or an "i", followed by "nvalid"
followed by at least one whitespace character
followed by a subpattern in brackets: at least one character that might make up a word (a username in this case)
at least one whitespace character
Now we search for the pattern in each line:
>>> matches = [re.search(pattern, line) for line in lines]
What do we find?
>>> matches
[None,
None,
<re.Match object; span=(40, 61), match='Invalid user backups '>,
<re.Match object; span=(64, 85), match='invalid user backups '>]
Let's loop through the good matches and print the match for the subpattern:
>>> for match in filter(None, matches):
... print(match.group(1))
...
backups
backups
Putting it all together, it might look like this:
import re
pattern = r'[Ii]nvalid user\s+(\w+)\s+'
with open ('auth.log', 'rt') as myfile:
matches = [re.search(pattern, line) for line in myfile]
for match in sorted(filter(None, matches)):
print(match.group(1))

How to extract a word from text in Python

I have this string "IP 1.2.3.4 is currently trusted in the white list, but it is now using a new trusted certificate." in a log file. What I need to do is look for this message and extract the IP address (1.2.3.4) from the log file.
import os
import shutil
import optparse
import sys
def main():
file = open("messages", "r")
log_data = file.read()
file.close()
search_str = "is currently trusted in the white list, but it is now using a new trusted certificate."
index = log_data.find(search_str)
print index
return
if __name__ == '__main__':
main()
How do I extract the IP address? Your response is appreciated.

Really simple answer:
msg = "IP 1.2.3.4 is currently trusted in the white list, but it is now using a new trusted certificate."
parts = msg.split(' ', 2)
print parts[1]
results in:
1.2.3.4
You could also do REs if you wanted, but for something this simple...

There will be dozens of possible approaches, pros and cons depend on the details of your log file. One example, using the re module:
import re
x = "IP 1.2.3.4 is currently trusted in the white list, but it is now using a new trusted certificate."
pattern = "IP ([0-9\.]+) is currently trusted in the white list"
m = re.match(pattern, x)
for ip in m.groups():
print ip
If you want to print out every instance of that string in your log file, you'd do something like this:
import re
pattern = "(IP [9-0\.]+ is currently trusted in the white list, but it is now using a new trusted certificate.)"
m = re.match(pattern, log_data)
for match in m.groups():
print match

Use regular expressions.
Code like this:
import re
compiled = re.compile(r"""
.*? # Leading junk
(?P<ipaddress>\d+\.\d+\.\d+\.\d+) # IP address
.*? # Trailing junk
""", re.VERBOSE)
str = "IP 1.2.3.4 is currently trusted in the white list, but it is now using a new trusted certificate."
m = compiled.match(str)
print m.group("ipaddress")
And you get this:
>>> import re
>>>
>>> compiled = re.compile(r"""
... .*? # Leading junk
... (?P<ipaddress>\d+\.\d+\.\d+\.\d+) # IP address
... .*? # Trailing junk
... """, re.VERBOSE)
>>> str = "IP 1.2.3.4 is currently trusted in the white list, but it is now using a new trusted certificate."
>>> m = compiled.match(str)
>>> print m.group("ipaddress")
1.2.3.4
Also, I learned there there is a dictionary of matches, groupdict():
>>>> str = "Peer 10.11.6.224 is currently trusted in the white list, but it is now using a new trusted certificate. Consider removing its likely outdated white list entry."
>>>> m = compiled.match(str)
>>>> print m.groupdict()
{'ipaddress': '10.11.6.224'}
Later: fixed that. The initial '.*' was eating your first character match. Changed it to be non-greedy. For consistency (but not necessity), I changed the trailing match, too.

Regular expression is the way to go. But if you fill uncomfortably writing them, you can try a small parser that I wrote (https://github.com/hgrecco/stringparser). It translates a string format to a regular expression. In your case, you will do the following:
from stringparser import Parser
parser = Parser("IP {} is currently trusted in the white list, but it is now using a new trusted certificate.")
ip = parser(text)
If you have a file with multiple lines you can replace the last line by:
with open("log.txt", "r") as fp:
ips = [parser(line) for line in fp]
Good luck.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Custom regex pattern for matching email addresses - python

Related

python re match ip address

Regex to match ERROR or INFO, but with different actions depending on the match

Regex not working in python, but in online regex tools

I need to find all instances of a phrase in a file and then print a sorted list of the next word after that phrase

How to extract a word from text in Python

Categories

Resources