Sorting the unique values from regex match in python - python

I am trying to parse a log file to extract email addresses.
I am able to match the email and print it with the help of regular expressions.
I noticed that there are a couple of duplicate emails in my log file. Can you help me in figuring out how I can remove the duplicates and print only the unique email addresses based on matched patterns.
Here is the code I have written so far :
import sys
import re
file = open('/Users/me/Desktop/test.txt', 'r')
temp =[]
for line in file.readlines():
if '->' in line:
temp = line.split('->')
elif '=>' in line:
temp = line.split('=>')
if temp:
#temp[1].strip()
pattern = re.match('^\x20\w{1,}#\w{1,}\.\w{2,3}\x20?', str(temp[1]), re.M)
if pattern is not None:
print pattern.group()
else:
print "nono"
Here is my example log file that I am trying to parse:
Feb 24 00:00:23 smtp1.mail.net exim[5660]: 2014-02-24 00:00:23
1Wuniq-mail-idSo-Fg -> someuser#somedomain.com R=mail T=remote_smtp
H=smtp.mail.net [000.00.34.17]
Feb 24 00:00:23 smtp1.mail.net exim[5660]: 2014-02-24 00:00:23
1Wuniq-mail-idSo-Fg -> someuser#somedomain.com R=mail T=remote_smtp
H=smtp.mail.net [000.00.34.17]
Feb 24 00:00:23 smtp1.mail.net exim[5661]: 2014-02-24 00:00:23
1Wuniq-mail-idSm-1h => someuser#somedomain.com R=mail T=pop_mail_net
H=mta.mail.net [000.00.34.6]
Feb 24 00:00:23 smtp1.mail.net exim[5661]: 2014-02-24 00:00:23
1Wuniq-mail-idSm-1h => me#somedomain.com R=mail T=pop_mail_net
H=mta.mail.net [000.00.34.6]
Feb 24 00:00:23 smtp1.mail.net exim[5661]: 2014-02-24 00:00:23
1Wuniq-mail-idSm-1h => wo#somedomain.com R=mail T=pop_mail_net
H=mta.mail.net [000.00.34.6]
Feb 24 00:00:23 smtp1.mail.net exim[5661]: 2014-02-24 00:00:23
1Wuniq-mail-idSm-1h => lol#somedomain.com R=mail T=pop_mail_net
H=mta.mail.net [000.00.34.6]
Feb 24 00:00:23 smtp1.mail.net exim[5661]: 2014-02-24 00:00:23
1Wuniq-mail-idSm-1h Completed
Also, I am curious if I can improve my program or the regex. Any suggestion would be very helpful.
Thanks in advance.

As danidee (he was first) said, set would do the trick
Try this:
from __future__ import print_function
import re
with open('test.txt') as f:
data = f.read().splitlines()
emails = set(re.sub(r'^.*\s+(\w+\#[^\s]*?)\s+.*', r'\1', line) for line in data if '#' in line)
print('\n'.join(emails)) if len(emails) else print('nono')
Output:
lol#somedomain.com
me#somedomain.com
someuser#somedomain.com
wo#somedomain.com
PS you may want to do a proper email RegExp check, because i used very primitive check

Some of the duplicates are due to a bug in your code where you do not reset temp when processing each line. A line that does not contain either -> or => and which is preceded by a line that does contain either of those strings will trigger the if temp: test, and output the email address from the previous line if there was one.
That can be fixed by jumping back to the start of the loop with continue when the line contains neither -> nor =>.
For the other genuine duplicates that occur because the same email address appears in multiple lines, you can filter those out with a set.
import sys
import re
addresses = set()
pattern = re.compile('^\x20\w{1,}#\w{1,}\.\w{2,3}\x20?')
with open('/Users/me/Desktop/test.txt', 'r') as f:
for line in f:
if '->' in line:
temp = line.split('->')
elif '=>' in line:
temp = line.split('=>')
else:
# neither '=>' nor '->' present in the line
continue
match = pattern.match(temp[1])
if match is not None:
addresses.add(match.group())
else:
print "nono"
for address in sorted(addresses):
print(address)
The addresses are stored in a set to remove duplicates. Then they are sorted and printed. Note also the use of the with statement to open the file within a context manager. This guarantees that the file will always be closed.
Also, as you will be applying the same regex pattern many times, it is worth compiling it ahead of time for better efficiency.
With a properly written regex pattern your code can be greatly simplified:
import re
addresses = set()
pattern = re.compile(r'[-=]> +(\w{1,}#\w{1,}\.\w{2,3})')
with open('test.txt', 'r') as f:
for line in f:
match = pattern.search(line)
if match:
addresses.add(match.groups()[0])
for address in sorted(addresses):
print(address)

You can use a set container in order to preserve the unique results and each time that you want to print a matched email you can check if it doesn't exist in your set you print it:
import sys
import re
file = open('/Users/me/Desktop/test.txt', 'r')
temp =[]
seen = set()
for line in file.readlines():
if '->' in line:
temp = line.split('->')
elif '=>' in line:
temp = line.split('=>')
if temp:
#temp[1].strip()
pattern = re.match('^\x20\w{1,}#\w{1,}\.\w{2,3}\x20?', str(temp[1]), re.M)
if pattern is not None:
matched = pattern.group()
if matched not in seen:
print matched
else:
print "nono"

Related

How to re-sort a list/text with Python?

My bot reads another bot's message, then temporarily saves that message, makes a few changes with .replace and then the bot is supposed to change the format of the entries it finds.
I have tried quite a few things, but have not figured it out.
The text looks like this:
06 6 452872995438985XXX
09 22 160462182344032XXX
11 17 302885091519234XXX
And I want to get the following format:
6/06 452872995438985XXX
22/09 160462182344032XXX
17/11 302885091519234XXX
I have already tried the following things:
splitsprint = test.split(' ') # test is in this case the string we use e.g. the text shown above
for x in splitsprint:
month, day, misc = x
print(f"{day}/{month} {misc}")
---
newline = test.split('\n')
for line in newline:
month, day, misc = line.split(' ')
print(f"{day}/{month} {misc}")
But always I got a ValueError: too many values to unpack (expected 3) error or similar.
Does anyone here see my error?
It's because of the trailing white space in the input, I'm guessing. Use strip
s = '''
06 6 452872995438985XXX
09 22 160462182344032XXX
11 17 302885091519234XXX
'''
lines = s.strip().split('\n')
tokens = [l.split(' ') for l in lines]
final = [f'{day}/{month} {misc}' for month, day, misc in tokens]
for f in final:
print(f)

Splitting re.Match object type by words

I have this code that scans my emails and returns an important identifying number and its corresponding date. I am trying to split the number and the date into separate substrings separated by columns (ultimate plan is to get them all the info into a csv), but I get the following error: AttributeError: 're.Match' object has no attribute 'split'. Any help is appreciated. Here's my code:
pattern = re.compile(r'[a-zA-Z]+[0-9]+ [0-9]+/[0-9]+/[0-9]+')
matches = pattern.finditer(body)
for match in matches:
matches.split()
I expect the output to look like the following:
AAA111111, 1/1/2022
BBB222222, 1/1/2022
and so on. Goal is to turn it into a csv that I can import elsewhere
Also, here is what goes into body:
''
Thanks for following up. Here’s an update on your orders.
PUU128377 5/22/2023
PUN102938 11/1/2024
PUU012938 10/01/2025
Reach out with any further questions
''
New email with extended info
PUU128377 Line 20 Seq 1 5/22/2023
PUN102938 Line 100 Seq 8 11/1/2024
PUU012938 Line 120 Seq 4 1/1/2025
Try:
import re
import csv
body = """\
PUU128377 Line 20 Seq 1 5/22/2023
PUN102938 Line 100 Seq 8 11/1/2024
PUU012938 Line 120 Seq 4 1/1/2025
"""
pattern = re.compile(
r"([a-zA-Z]+[0-9]+) Line ([0-9]+) Seq ([0-9]) ([0-9]+/[0-9]+/[0-9]+)"
)
matches = pattern.finditer(body)
with open("data.csv", "w") as f_out:
writer = csv.writer(f_out)
writer.writerows(map(lambda m: m.groups(), matches))
This creates data.csv file:
PUU128377,20,1,5/22/2023
PUN102938,100,8,11/1/2024
PUU012938,120,4,1/1/2025
Edit: Updated answer with new input

Python: How to print element from an array that starts with a specific number using Regex?

I have a file with some data:
Dave Martin
615-555-7164
173 Main St., Springfield RI 55924
davemartin#bogusemail.com
Charles Harris
800-555-5669
969 High St., Atlantis VA 34075
charlesharris#bogusemail.com
Eric Williams
560-555-5153
806 1st St., Faketown AK 86847
laurawilliams#bogusemail.com
Next, I read the lines of the file and want to store all elements of that list that start with my specific number, into a new list.
from os import sep
import re
number = int(input("Number: "))
last_digit = number % 10
final = str(last_digit)
results = []
with open("data.txt") as f:
lines = f.readlines()
desired = lines[2::5]
results = re.findall('^[4][0-9]{2}$', desired)
print(results)
I don't know how to put my number in Regex, for now it's 4...
But I always get this error:
TypeError: expected string or bytes-like object
SOLVED
from os import sep
import re
number = int(input("Number: "))
last_digit = number % 10
final = str(last_digit)
results = []
with open("data.txt") as f:
lines = f.readlines()
desired = lines[2::5]
r = re.compile("[" + re.escape(final) + "][0-9]{2}.*")
newlist = list(filter(r.match, desired))
print("********************")
print(*newlist)
The error you have is related to fact, that your regex work over list and not string or bytes-like object.
Doing "\n".join(desired) will remove error, but regex have to be run with re.M flag to support multiline.

Python print both the matching groups in regex

I want to find two fixed patterns from a log file. Here is a line in a log file looks like
passed dangerb.xavier64.423181.k000.drmanhattan_resources.log Aug 23
04:19:37 84526 362
From this log, I want to extract drmanhattan and 362 which is a number just before the line ends.
Here is what I have tried so far.
import sys
import re
with open("Xavier.txt") as f:
for line in f:
match1 = re.search(r'((\w+_\w+)|(\d+$))',line)
if match1:
print match1.groups()
However, everytime I run this script, I always get drmanhattan as output and not drmanhattan 362.
Is it because of | sign?
How do I tell regex to catch this group and that group ?
I have already consulted this and this links however, it did not solve my problem.
line = 'Passed dangerb.xavier64.423181.r000.drmanhattan_resources.log Aug 23 04:19:37 84526 362'
match1 = re.search(r'(\w+_\w+).*?(\d+$)', line)
if match1:
print match1.groups()
# ('drmanhattan_resources', '362')
If you have a test.txt file that contains the following lines:
Passed dangerb.xavier64.423181.r000.drmanhattan_resources.log Aug 23
04:19:37 84526 362 Passed
dangerb.xavier64.423181.r000.drmanhattan_resources.log Aug 23 04:19:37
84526 363 Passed
dangerb.xavier64.423181.r000.drmanhattan_resources.log Aug 23 04:19:37
84526 361
you can do:
with open('test.txt', 'r') as fil:
for line in fil:
match1 = re.search(r'(\w+_\w+).*?(\d+)\s*$', line)
if match1:
print match1.groups()
# ('drmanhattan_resources', '362')
# ('drmanhattan_resources', '363')
# ('drmanhattan_resources', '361')
| mean OR so your regex catch (\w+_\w+) OR (\d+$)
Maybe you want something like this :
((\w+_\w+).*?(\d+$))
With re.search you only get the first match, if any, and with | you tell re to look for either this or that pattern. As suggested in other answers, you could replace the | with .* to match "anything in between" those two pattern. Alternatively, you could use re.findall to get all matches:
>>> line = "passed dangerb.xavier64.423181.k000.drmanhattan_resources.log Aug 23 04:19:37 84526 362"
>>> re.findall(r'\w+_\w+|\d+$', line)
['drmanhattan_resources', '362']

How can I search a range of lines in python?

I would like to search through a range of lines in a date ordered log file between two dates. If I were at the command line, sed would come handy with:
sed -rn '/03.Nov.2012/,/12.Oct.2013/s/search key/search key/p' my.log
The above would only display lines between the 3 November, 2012 and 12 of October, 2013 that contain the string "search key".
Is there a light weight way I can do this in python?
I could build a single RE for the above , but it would be nightmarish.
The best I can come up with is this:
#!/usr/bin/python
start_date = "03/Nov/2012"
end_date = "12/Oct/2013"
start = False
try:
with open("my.log",'r') as log:
for line in log:
if start:
if end_date in line:
break
else:
if start_date in line:
start = True
else:
continue
if search_key in line:
print line
except IOError, e:
print '<p>Log file not found.'
But this strikes me as not 'pythonic'.
One can assume that search date limits will be found in the log file.
Using itertools and a generator is one way:
from itertools import takewhile, dropwhile
with open('logfile') as fin:
start = dropwhile(lambda L: '03.Nov.2012' not in L, fin)
until = takewhile(lambda L: '12.Oct.2013' not in L, start)
query = (line for line in until if 'search string' in line)
for line in query:
pass # do something

Categories

Resources