Python: how to extract string from file - only once - python

I have the below output from router stored in a file
-#- --length-- -----date/time------ path
3 97103164 Feb 7 2016 01:36:16 +05:30 taas/NN41_R11_Golden_Image
4 1896 Sep 27 2019 14:22:08 +05:30 taas/NN41_R11_Golden_Config
5 1876 Nov 27 2017 20:07:50 +05:30 taas/nfast_default.cfg
I want to search for substring 'Golden_Image' from the file & get the complete path. So here, the required output would be this string:
taas/NN41_R11_Golden_Image
First attempt:
import re
with open("outlog.out") as f:
for line in f:
if "Golden_Image" in line:
print(line)
Output:
3 97103164 Feb 7 2016 01:36:16 +05:30 taas/NN41_R11_Golden_Image
Second attempt
import re
hand = open('outlog.out')
for line in hand:
line = line.rstrip()
x = re.findall('.*?Golden_Image.*?',line)
if len(x) > 0:
print x
Output:
['3 97103164 Feb 7 2016 01:36:16 +05:30 taas/NN41_R11_Golden_Image']
Neither of these give the required output. How can I fix this?

This is actually surprisingly fiddly to do if the path can contain spaces.
You need to use the maxsplit argument to split to identify the path field.
with open("outlog.out") as f:
for line in f:
field = line.split(None,7)
if "Golden_Image" in field:
print(field)

Do split on the line and check for the "Golden_Image" string exists in the splitted parts.
import re
with open("outlog.out") as f:
for line in f:
if not "Golden_Image" in i:
continue
print re.search(r'\S*Golden_Image\S*', line).group()
or
images = re.findall(r'\S*Golden_Image\S*', open("outlog.out").read())
Example:
>>> s = '''
-#- --length-- -----date/time------ path
3 97103164 Feb 7 2016 01:36:16 +05:30 taas/NN41_R11_Golden_Image
4 1896 Sep 27 2019 14:22:08 +05:30 taas/NN41_R11_Golden_Config
5 1876 Nov 27 2017 20:07:50 +05:30 taas/nfast_default.cfg'''.splitlines()
>>> for line in s:
for i in line.split():
if "Golden_Image" in i:
print i
taas/NN41_R11_Golden_Image
>>>

Reading full content at once and then doing the search will not be efficient. Instead, file can be read line by line and if line matches the criteria then path can be extracted without doing further split and using RegEx.
Use following RegEx to get path
\s+(?=\S*$).*
Link: https://regex101.com/r/zuH0Zv/1
Here if working code:
import re
data = "3 97103164 Feb 7 2016 01:36:16 +05:30 taas/NN41_R11_Golden_Image"
regex = r"\s+(?=\S*$).*"
test_str = "3 97103164 Feb 7 2016 01:36:16 +05:30 taas/NN41_R11_Golden_Image"
matches = re.search(regex, test_str)
print(matches.group().strip())

Follow you code, if you just want get the right output, you can more simple.
with open("outlog.out") as f:
for line in f:
if "Golden_Image" in line:
print(line.split(" ")[-1])
the output is :
taas/NN41_R11_Golden_Image
PS: if you want some more complex operations, you may need try the re module like the #Avinash Raj answered.

Related

Extracting Date-stamp from a email subject

Trying a very simple extraction (can be anything), but not able to figure out what the problem is? In the code below I am trying to extract the hour from the date stamp in the email body.
txt='From stephen.marquard#uct.ac.za Sat Jan 5 09:14:16 2008'
hr=list()
for line in txt:
if line.startswith('From '):
line=line.split()
hr.append(line[5].split(':')[0])
print(line)
print(hr)
it gives me 8 (for print(line) and [] (for print(hr)
I just want to understand why this is not giving the below
['From', 'stephen.marquard#uct.ac.za', 'Sat', 'Jan', '5', '09:14:16', '2008']
['09']
If you can be certain about the structure of the string then:-
txt = 'From stephen.marquard#uct.ac.za Sat Jan 5 09:14:16 2008'
print(txt.split()[5][:2])

Python print both the matching groups in regex

I want to find two fixed patterns from a log file. Here is a line in a log file looks like
passed dangerb.xavier64.423181.k000.drmanhattan_resources.log Aug 23
04:19:37 84526 362
From this log, I want to extract drmanhattan and 362 which is a number just before the line ends.
Here is what I have tried so far.
import sys
import re
with open("Xavier.txt") as f:
for line in f:
match1 = re.search(r'((\w+_\w+)|(\d+$))',line)
if match1:
print match1.groups()
However, everytime I run this script, I always get drmanhattan as output and not drmanhattan 362.
Is it because of | sign?
How do I tell regex to catch this group and that group ?
I have already consulted this and this links however, it did not solve my problem.
line = 'Passed dangerb.xavier64.423181.r000.drmanhattan_resources.log Aug 23 04:19:37 84526 362'
match1 = re.search(r'(\w+_\w+).*?(\d+$)', line)
if match1:
print match1.groups()
# ('drmanhattan_resources', '362')
If you have a test.txt file that contains the following lines:
Passed dangerb.xavier64.423181.r000.drmanhattan_resources.log Aug 23
04:19:37 84526 362 Passed
dangerb.xavier64.423181.r000.drmanhattan_resources.log Aug 23 04:19:37
84526 363 Passed
dangerb.xavier64.423181.r000.drmanhattan_resources.log Aug 23 04:19:37
84526 361
you can do:
with open('test.txt', 'r') as fil:
for line in fil:
match1 = re.search(r'(\w+_\w+).*?(\d+)\s*$', line)
if match1:
print match1.groups()
# ('drmanhattan_resources', '362')
# ('drmanhattan_resources', '363')
# ('drmanhattan_resources', '361')
| mean OR so your regex catch (\w+_\w+) OR (\d+$)
Maybe you want something like this :
((\w+_\w+).*?(\d+$))
With re.search you only get the first match, if any, and with | you tell re to look for either this or that pattern. As suggested in other answers, you could replace the | with .* to match "anything in between" those two pattern. Alternatively, you could use re.findall to get all matches:
>>> line = "passed dangerb.xavier64.423181.k000.drmanhattan_resources.log Aug 23 04:19:37 84526 362"
>>> re.findall(r'\w+_\w+|\d+$', line)
['drmanhattan_resources', '362']

Sorting the unique values from regex match in python

I am trying to parse a log file to extract email addresses.
I am able to match the email and print it with the help of regular expressions.
I noticed that there are a couple of duplicate emails in my log file. Can you help me in figuring out how I can remove the duplicates and print only the unique email addresses based on matched patterns.
Here is the code I have written so far :
import sys
import re
file = open('/Users/me/Desktop/test.txt', 'r')
temp =[]
for line in file.readlines():
if '->' in line:
temp = line.split('->')
elif '=>' in line:
temp = line.split('=>')
if temp:
#temp[1].strip()
pattern = re.match('^\x20\w{1,}#\w{1,}\.\w{2,3}\x20?', str(temp[1]), re.M)
if pattern is not None:
print pattern.group()
else:
print "nono"
Here is my example log file that I am trying to parse:
Feb 24 00:00:23 smtp1.mail.net exim[5660]: 2014-02-24 00:00:23
1Wuniq-mail-idSo-Fg -> someuser#somedomain.com R=mail T=remote_smtp
H=smtp.mail.net [000.00.34.17]
Feb 24 00:00:23 smtp1.mail.net exim[5660]: 2014-02-24 00:00:23
1Wuniq-mail-idSo-Fg -> someuser#somedomain.com R=mail T=remote_smtp
H=smtp.mail.net [000.00.34.17]
Feb 24 00:00:23 smtp1.mail.net exim[5661]: 2014-02-24 00:00:23
1Wuniq-mail-idSm-1h => someuser#somedomain.com R=mail T=pop_mail_net
H=mta.mail.net [000.00.34.6]
Feb 24 00:00:23 smtp1.mail.net exim[5661]: 2014-02-24 00:00:23
1Wuniq-mail-idSm-1h => me#somedomain.com R=mail T=pop_mail_net
H=mta.mail.net [000.00.34.6]
Feb 24 00:00:23 smtp1.mail.net exim[5661]: 2014-02-24 00:00:23
1Wuniq-mail-idSm-1h => wo#somedomain.com R=mail T=pop_mail_net
H=mta.mail.net [000.00.34.6]
Feb 24 00:00:23 smtp1.mail.net exim[5661]: 2014-02-24 00:00:23
1Wuniq-mail-idSm-1h => lol#somedomain.com R=mail T=pop_mail_net
H=mta.mail.net [000.00.34.6]
Feb 24 00:00:23 smtp1.mail.net exim[5661]: 2014-02-24 00:00:23
1Wuniq-mail-idSm-1h Completed
Also, I am curious if I can improve my program or the regex. Any suggestion would be very helpful.
Thanks in advance.
As danidee (he was first) said, set would do the trick
Try this:
from __future__ import print_function
import re
with open('test.txt') as f:
data = f.read().splitlines()
emails = set(re.sub(r'^.*\s+(\w+\#[^\s]*?)\s+.*', r'\1', line) for line in data if '#' in line)
print('\n'.join(emails)) if len(emails) else print('nono')
Output:
lol#somedomain.com
me#somedomain.com
someuser#somedomain.com
wo#somedomain.com
PS you may want to do a proper email RegExp check, because i used very primitive check
Some of the duplicates are due to a bug in your code where you do not reset temp when processing each line. A line that does not contain either -> or => and which is preceded by a line that does contain either of those strings will trigger the if temp: test, and output the email address from the previous line if there was one.
That can be fixed by jumping back to the start of the loop with continue when the line contains neither -> nor =>.
For the other genuine duplicates that occur because the same email address appears in multiple lines, you can filter those out with a set.
import sys
import re
addresses = set()
pattern = re.compile('^\x20\w{1,}#\w{1,}\.\w{2,3}\x20?')
with open('/Users/me/Desktop/test.txt', 'r') as f:
for line in f:
if '->' in line:
temp = line.split('->')
elif '=>' in line:
temp = line.split('=>')
else:
# neither '=>' nor '->' present in the line
continue
match = pattern.match(temp[1])
if match is not None:
addresses.add(match.group())
else:
print "nono"
for address in sorted(addresses):
print(address)
The addresses are stored in a set to remove duplicates. Then they are sorted and printed. Note also the use of the with statement to open the file within a context manager. This guarantees that the file will always be closed.
Also, as you will be applying the same regex pattern many times, it is worth compiling it ahead of time for better efficiency.
With a properly written regex pattern your code can be greatly simplified:
import re
addresses = set()
pattern = re.compile(r'[-=]> +(\w{1,}#\w{1,}\.\w{2,3})')
with open('test.txt', 'r') as f:
for line in f:
match = pattern.search(line)
if match:
addresses.add(match.groups()[0])
for address in sorted(addresses):
print(address)
You can use a set container in order to preserve the unique results and each time that you want to print a matched email you can check if it doesn't exist in your set you print it:
import sys
import re
file = open('/Users/me/Desktop/test.txt', 'r')
temp =[]
seen = set()
for line in file.readlines():
if '->' in line:
temp = line.split('->')
elif '=>' in line:
temp = line.split('=>')
if temp:
#temp[1].strip()
pattern = re.match('^\x20\w{1,}#\w{1,}\.\w{2,3}\x20?', str(temp[1]), re.M)
if pattern is not None:
matched = pattern.group()
if matched not in seen:
print matched
else:
print "nono"

dont want character by character printing

File content:
aditya#aditya-virtual-machine:~/urlcat$ cat http_resp
telnet 10.192.67.40 80
Trying 10.192.67.40...
Connected to 10.192.67.40.
Escape character is '^]'.
GET /same_domain HTTP/1.1
Host: www.google.com
HTTP/1.1 200 OK
Date: Tue, 09 Feb 2016 00:25:36 GMT
Server: Apache/2.4.7 (Ubuntu)
Last-Modified: Fri, 08 Jan 2016 20:10:52 GMT
ETag: "81-528d82f2644f1"
Accept-Ranges: bytes
Content-Length: 129
My code:
f1 = open('http_resp')
read=f1.read()
for line in read:
# line=line.rstrip()
line=line.strip()
if not '.com' in line:
continue
print line
When if not logic is removed, the output is something like this:
it prints only single character line by line.
t
e
l
n
e
t
1
0
.
1
9
2
.
6
7
.
4
0
8
0
T
r
y
i
n
g
I don't want character-by-character printing.
The problem is that read() returns the entire file as a string. Thus, your loop
for line in read:
iterates through the characters, one at a time. The simplest change is this:
f1 = open('http_resp')
for line in f1.readlines():

How to turn a file into a list

I am currently trying to make a text file with numbers into a list
the text file is
1.89
1.99
2.14
2.51
5.03
3.81
1.97
2.31
2.91
3.97
2.68
2.44
Right now I only know how to read the file. How can i make this into a list?
afterwards how can I assign the list to another list?
for example
jan = 1.89
feb = 1.99
etc
Code from comments:
inFile = open('program9.txt', 'r')
lineRead = inFile.readline()
while lineRead != '':
words = lineRead.split()
annualRainfall = float(words[0])
print(format(annualRainfall, '.2f'))
lineRead = inFile.readline()
inFile.close()
months = ('jan', 'feb', ...)
with open('filename', 'rb') as f:
my_list = [float(x) for x in f]
res = dict(zip(months, my_list))
This will however work ONLY if there are the same number of lines than months!
A file is already an iterable of lines, so you don't have to do anything to make it into an iterable of lines.
If you want to make it specifically into a list of lines, you can do the same thing as with any other iterable: call list on it:
with open(filename) as f:
lines = list(f)
But if you want to convert this into a list of floats, it doesn't matter what kind of iterable you start with, so you might as well just use the file as-is:
with open(filename) as f:
floats = [float(line) for line in f]
(Note that float ignores trailing whitespace, so it doesn't matter whether you use a method that strips off the newlines or leaves them in place.)
From a comment:
now i just need to find out how to assign the list to another list like jan = 1.89, feb = 1.99 and so on
If you know you have exactly 12 values (and it will be an error if you don't), you can write whichever of these looks nicer to you:
jan, feb, mar, apr, may, jun, jul, aug, sep, oct, nov, dec = (float(line) for line in f)
jan, feb, mar, apr, may, jun, jul, aug, sep, oct, nov, dec = map(float, f)
However, it's often a bad idea to have 12 separate variables like this. Without knowing how you're planning to use them, it's hard to say for sure (see Keep data out of your variable names for some relevant background on making the decision yourself), but it might be better to have, say, a single dictionary, using the month names as keys:
floats = dict(zip('jan feb mar apr may jun jul aug sep oct nov dec'.split(),
map(float, f))
Or to just leave the values in a list, and use the month names as just symbolic names for indices into that list:
>>> jan, feb, mar, apr, may, jun, jul, aug, sep, oct, nov, dec = range(12)
>>> print(floats[mar])
2.14
That might be even nicer with an IntEnum, or maybe a namedtuple. Again, it really depends on how you plan to use the data after importing them.

Categories

Resources