Python print both the matching groups in regex

Python print both the matching groups in regex - python

I want to find two fixed patterns from a log file. Here is a line in a log file looks like
passed dangerb.xavier64.423181.k000.drmanhattan_resources.log Aug 23
04:19:37 84526 362
From this log, I want to extract drmanhattan and 362 which is a number just before the line ends.
Here is what I have tried so far.
import sys
import re
with open("Xavier.txt") as f:
for line in f:
match1 = re.search(r'((\w+_\w+)|(\d+$))',line)
if match1:
print match1.groups()
However, everytime I run this script, I always get drmanhattan as output and not drmanhattan 362.
Is it because of | sign?
How do I tell regex to catch this group and that group ?
I have already consulted this and this links however, it did not solve my problem.

line = 'Passed dangerb.xavier64.423181.r000.drmanhattan_resources.log Aug 23 04:19:37 84526 362'
match1 = re.search(r'(\w+_\w+).*?(\d+$)', line)
if match1:
print match1.groups()
# ('drmanhattan_resources', '362')
If you have a test.txt file that contains the following lines:
Passed dangerb.xavier64.423181.r000.drmanhattan_resources.log Aug 23
04:19:37 84526 362 Passed
dangerb.xavier64.423181.r000.drmanhattan_resources.log Aug 23 04:19:37
84526 363 Passed
dangerb.xavier64.423181.r000.drmanhattan_resources.log Aug 23 04:19:37
84526 361
you can do:
with open('test.txt', 'r') as fil:
for line in fil:
match1 = re.search(r'(\w+_\w+).*?(\d+)\s*$', line)
if match1:
print match1.groups()
# ('drmanhattan_resources', '362')
# ('drmanhattan_resources', '363')
# ('drmanhattan_resources', '361')

| mean OR so your regex catch (\w+_\w+) OR (\d+$)
Maybe you want something like this :
((\w+_\w+).*?(\d+$))

With re.search you only get the first match, if any, and with | you tell re to look for either this or that pattern. As suggested in other answers, you could replace the | with .* to match "anything in between" those two pattern. Alternatively, you could use re.findall to get all matches:
>>> line = "passed dangerb.xavier64.423181.k000.drmanhattan_resources.log Aug 23 04:19:37 84526 362"
>>> re.findall(r'\w+_\w+|\d+$', line)
['drmanhattan_resources', '362']

Related

How to re-sort a list/text with Python?

My bot reads another bot's message, then temporarily saves that message, makes a few changes with .replace and then the bot is supposed to change the format of the entries it finds.
I have tried quite a few things, but have not figured it out.
The text looks like this:
06 6 452872995438985XXX
09 22 160462182344032XXX
11 17 302885091519234XXX
And I want to get the following format:
6/06 452872995438985XXX
22/09 160462182344032XXX
17/11 302885091519234XXX
I have already tried the following things:
splitsprint = test.split(' ') # test is in this case the string we use e.g. the text shown above
for x in splitsprint:
month, day, misc = x
print(f"{day}/{month} {misc}")
---
newline = test.split('\n')
for line in newline:
month, day, misc = line.split(' ')
print(f"{day}/{month} {misc}")
But always I got a ValueError: too many values to unpack (expected 3) error or similar.
Does anyone here see my error?

It's because of the trailing white space in the input, I'm guessing. Use strip
s = '''
06 6 452872995438985XXX
09 22 160462182344032XXX
11 17 302885091519234XXX
'''
lines = s.strip().split('\n')
tokens = [l.split(' ') for l in lines]
final = [f'{day}/{month} {misc}' for month, day, misc in tokens]
for f in final:
print(f)

Python: how to extract string from file - only once

I have the below output from router stored in a file
-#- --length-- -----date/time------ path
3 97103164 Feb 7 2016 01:36:16 +05:30 taas/NN41_R11_Golden_Image
4 1896 Sep 27 2019 14:22:08 +05:30 taas/NN41_R11_Golden_Config
5 1876 Nov 27 2017 20:07:50 +05:30 taas/nfast_default.cfg
I want to search for substring 'Golden_Image' from the file & get the complete path. So here, the required output would be this string:
taas/NN41_R11_Golden_Image
First attempt:
import re
with open("outlog.out") as f:
for line in f:
if "Golden_Image" in line:
print(line)
Output:
3 97103164 Feb 7 2016 01:36:16 +05:30 taas/NN41_R11_Golden_Image
Second attempt
import re
hand = open('outlog.out')
for line in hand:
line = line.rstrip()
x = re.findall('.*?Golden_Image.*?',line)
if len(x) > 0:
print x
Output:
['3 97103164 Feb 7 2016 01:36:16 +05:30 taas/NN41_R11_Golden_Image']
Neither of these give the required output. How can I fix this?

This is actually surprisingly fiddly to do if the path can contain spaces.
You need to use the maxsplit argument to split to identify the path field.
with open("outlog.out") as f:
for line in f:
field = line.split(None,7)
if "Golden_Image" in field:
print(field)

Do split on the line and check for the "Golden_Image" string exists in the splitted parts.
import re
with open("outlog.out") as f:
for line in f:
if not "Golden_Image" in i:
continue
print re.search(r'\S*Golden_Image\S*', line).group()
or
images = re.findall(r'\S*Golden_Image\S*', open("outlog.out").read())
Example:
>>> s = '''
-#- --length-- -----date/time------ path
3 97103164 Feb 7 2016 01:36:16 +05:30 taas/NN41_R11_Golden_Image
4 1896 Sep 27 2019 14:22:08 +05:30 taas/NN41_R11_Golden_Config
5 1876 Nov 27 2017 20:07:50 +05:30 taas/nfast_default.cfg'''.splitlines()
>>> for line in s:
for i in line.split():
if "Golden_Image" in i:
print i
taas/NN41_R11_Golden_Image
>>>

Reading full content at once and then doing the search will not be efficient. Instead, file can be read line by line and if line matches the criteria then path can be extracted without doing further split and using RegEx.
Use following RegEx to get path
\s+(?=\S*$).*
Link: https://regex101.com/r/zuH0Zv/1
Here if working code:
import re
data = "3 97103164 Feb 7 2016 01:36:16 +05:30 taas/NN41_R11_Golden_Image"
regex = r"\s+(?=\S*$).*"
test_str = "3 97103164 Feb 7 2016 01:36:16 +05:30 taas/NN41_R11_Golden_Image"
matches = re.search(regex, test_str)
print(matches.group().strip())

Follow you code, if you just want get the right output, you can more simple.
with open("outlog.out") as f:
for line in f:
if "Golden_Image" in line:
print(line.split(" ")[-1])
the output is :
taas/NN41_R11_Golden_Image
PS： if you want some more complex operations, you may need try the re module like the #Avinash Raj answered.

grep with python subprocess replacement

On a switch, i run ntpq -nc rv and get an output:
associd=0 status=0715 leap_none, sync_ntp, 1 event, clock_sync,
version="ntpd 4.2.6p3-RC10#1.2239-o Mon Mar 21 02:53:48 UTC 2016 (1)",
processor="x86_64", system="Linux/3.4.43.Ar-3052562.4155M", leap=00,
stratum=2, precision=-21, rootdelay=23.062, rootdisp=46.473,
refid=17.253.24.125,
reftime=dbf98d39.76cf93ad Mon, Dec 12 2016 20:55:21.464,
clock=dbf9943.026ea63c Mon, Dec 12 2016 21:28:03.009, peer=43497,
tc=10, mintc=3, offset=-0.114, frequency=27.326, sys_jitter=0.151,
clk_jitter=0.162, clk_wander=0.028
I am attempting to create a bash shell command using Python's subprocess module to extract only the value for "offset", or -0.114 in the example above
I noticed that I can use the subprocess replacement mod or sh for this such that:
import sh
print(sh.grep(sh.ntpq("-nc rv"), 'offset'))
and I get:
mintc=3, offset=-0.114, frequency=27.326, sys_jitter=0.151,
which is incorrect as I just want the value for 'offset', -0.114.
Not sure what I am doing wrong here, whether its my grep function or I am not using the sh module correctly.

grep reads line by line; it returns every line matching any part of the input. But I think grep is overkill. Once you get shell output, just search for the thing after output:
items = sh.ntpq("-nc rv").split(',')
for pair in items:
name, value = pair.split('=')
# strip because we weren't careful with whitespace
if name.strip() == 'offset':
print(value.strip())

Sorting the unique values from regex match in python

I am trying to parse a log file to extract email addresses.
I am able to match the email and print it with the help of regular expressions.
I noticed that there are a couple of duplicate emails in my log file. Can you help me in figuring out how I can remove the duplicates and print only the unique email addresses based on matched patterns.
Here is the code I have written so far :
import sys
import re
file = open('/Users/me/Desktop/test.txt', 'r')
temp =[]
for line in file.readlines():
if '->' in line:
temp = line.split('->')
elif '=>' in line:
temp = line.split('=>')
if temp:
#temp[1].strip()
pattern = re.match('^\x20\w{1,}#\w{1,}\.\w{2,3}\x20?', str(temp[1]), re.M)
if pattern is not None:
print pattern.group()
else:
print "nono"
Here is my example log file that I am trying to parse:
Feb 24 00:00:23 smtp1.mail.net exim[5660]: 2014-02-24 00:00:23
1Wuniq-mail-idSo-Fg -> someuser#somedomain.com R=mail T=remote_smtp
H=smtp.mail.net [000.00.34.17]
Feb 24 00:00:23 smtp1.mail.net exim[5660]: 2014-02-24 00:00:23
1Wuniq-mail-idSo-Fg -> someuser#somedomain.com R=mail T=remote_smtp
H=smtp.mail.net [000.00.34.17]
Feb 24 00:00:23 smtp1.mail.net exim[5661]: 2014-02-24 00:00:23
1Wuniq-mail-idSm-1h => someuser#somedomain.com R=mail T=pop_mail_net
H=mta.mail.net [000.00.34.6]
Feb 24 00:00:23 smtp1.mail.net exim[5661]: 2014-02-24 00:00:23
1Wuniq-mail-idSm-1h => me#somedomain.com R=mail T=pop_mail_net
H=mta.mail.net [000.00.34.6]
Feb 24 00:00:23 smtp1.mail.net exim[5661]: 2014-02-24 00:00:23
1Wuniq-mail-idSm-1h => wo#somedomain.com R=mail T=pop_mail_net
H=mta.mail.net [000.00.34.6]
Feb 24 00:00:23 smtp1.mail.net exim[5661]: 2014-02-24 00:00:23
1Wuniq-mail-idSm-1h => lol#somedomain.com R=mail T=pop_mail_net
H=mta.mail.net [000.00.34.6]
Feb 24 00:00:23 smtp1.mail.net exim[5661]: 2014-02-24 00:00:23
1Wuniq-mail-idSm-1h Completed
Also, I am curious if I can improve my program or the regex. Any suggestion would be very helpful.
Thanks in advance.

As danidee (he was first) said, set would do the trick
Try this:
from __future__ import print_function
import re
with open('test.txt') as f:
data = f.read().splitlines()
emails = set(re.sub(r'^.*\s+(\w+\#[^\s]*?)\s+.*', r'\1', line) for line in data if '#' in line)
print('\n'.join(emails)) if len(emails) else print('nono')
Output:
lol#somedomain.com
me#somedomain.com
someuser#somedomain.com
wo#somedomain.com
PS you may want to do a proper email RegExp check, because i used very primitive check

Some of the duplicates are due to a bug in your code where you do not reset temp when processing each line. A line that does not contain either -> or => and which is preceded by a line that does contain either of those strings will trigger the if temp: test, and output the email address from the previous line if there was one.
That can be fixed by jumping back to the start of the loop with continue when the line contains neither -> nor =>.
For the other genuine duplicates that occur because the same email address appears in multiple lines, you can filter those out with a set.
import sys
import re
addresses = set()
pattern = re.compile('^\x20\w{1,}#\w{1,}\.\w{2,3}\x20?')
with open('/Users/me/Desktop/test.txt', 'r') as f:
for line in f:
if '->' in line:
temp = line.split('->')
elif '=>' in line:
temp = line.split('=>')
else:
# neither '=>' nor '->' present in the line
continue
match = pattern.match(temp[1])
if match is not None:
addresses.add(match.group())
else:
print "nono"
for address in sorted(addresses):
print(address)
The addresses are stored in a set to remove duplicates. Then they are sorted and printed. Note also the use of the with statement to open the file within a context manager. This guarantees that the file will always be closed.
Also, as you will be applying the same regex pattern many times, it is worth compiling it ahead of time for better efficiency.
With a properly written regex pattern your code can be greatly simplified:
import re
addresses = set()
pattern = re.compile(r'[-=]> +(\w{1,}#\w{1,}\.\w{2,3})')
with open('test.txt', 'r') as f:
for line in f:
match = pattern.search(line)
if match:
addresses.add(match.groups()[0])
for address in sorted(addresses):
print(address)

You can use a set container in order to preserve the unique results and each time that you want to print a matched email you can check if it doesn't exist in your set you print it:
import sys
import re
file = open('/Users/me/Desktop/test.txt', 'r')
temp =[]
seen = set()
for line in file.readlines():
if '->' in line:
temp = line.split('->')
elif '=>' in line:
temp = line.split('=>')
if temp:
#temp[1].strip()
pattern = re.match('^\x20\w{1,}#\w{1,}\.\w{2,3}\x20?', str(temp[1]), re.M)
if pattern is not None:
matched = pattern.group()
if matched not in seen:
print matched
else:
print "nono"

Using Sed through subprocess.call in python to conduct in file replacements

I've got a column in one file that I'd like to replace with a column in another file. I'm trying to use sed to do this within python, but I'm not sure I'm doing it correctly. Maybe the code will make things more clear:
20 for line in infile1.readlines()[1:]:
21 element = re.split("\t", line)
22 IID.append(element[1])
23 FID.append(element[0])
24
25 os.chdir(binary_dir)
26
27 for files in os.walk(binary_dir):
28 for file in files:
29 for name in file:
30 if name.endswith(".fam"):
31 infile2 = open(name, 'r+')
32
33 for line in infile2.readlines():
34 parts = re.split(" ", line)
35 Part1.append(parts[0])
36 Part2.append(parts[1])
37
38 for i in range(len(Part2)):
39 if Part2[i] in IID:
40 regex = '"s/\.*' + Part2[i] + '/' + Part1[i] + ' ' + Part2[i] + '/"' + ' ' + phenotype
41 print regex
42 subprocess.call(["sed", "-i.orig", regex], shell=True)
This is what print regex does. The system appears to hang during the sed process, as it remains there for quite some time without doing anything.
"s/\.*131006/201335658-01 131006/" /Users/user1/Desktop/phenotypes2
Thanks for your help, and let me know if you need further clarification!

You don't need sed if you have Python and the re module. Here is an example of how to use re to replace a given pattern in a string.
>>> import re
>>> line = "abc def ghi"
>>> new_line = re.sub("abc", "123", line)
>>> new_line
'123 def ghi'
>>>
Of course this is only one way to do that in Python. I feel that for you str.replace() will do the job too.

The first issue is shell=True that is used together with a list argument. Either drop shell=True or use a string argument (the complete shell command) instead:
from subprocess import check_call
check_call(["sed", "-i.orig", regex])
otherwise the arguments ('-i.orig' and regex) are passed to /bin/sh instead of sed.
The second issue is that you haven't provided input files and therefore sed expects data from stdin that it is why it appears to hang.
If you want to make changes in files inplace, you could use fileinput module:
#!/usr/bin/env python
import fileinput
files = ['/Users/user1/Desktop/phenotypes2'] # if it is None it behaves like sed
for line in fileinput.input(files, backup='.orig', inplace=True):
print re.sub(r'\.*131006', '201335658-01 13100', line),
fileinput.input() redirects stdout to the current file i.e., print changes the file.
The comma sets sys.stdout.softspace to avoid duplicate newlines.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python print both the matching groups in regex - python

| mean OR so your regex catch (\w+_\w+) OR (\d+$) Maybe you want something like this : ((\w+_\w+).*?(\d+$))

Related

How to re-sort a list/text with Python?

Python: how to extract string from file - only once

grep with python subprocess replacement

Sorting the unique values from regex match in python

Using Sed through subprocess.call in python to conduct in file replacements

Categories

Resources