Regex help to match groups - python

I am trying to write a regex for matching a text file that has multiple lines such as :
* 964 0050.56aa.3480 dynamic 200 F F Veth1379
* 930 0025.b52a.dd7e static 0 F F Veth1469
My intention is to match the "0050.56aa.3480 " and "Veth1379" and put them in group(1) & group(2) for using later on.
The regex I wrote is :
\*\s*\d{1,}\s*(\d{1,}\.(?:[a-z][a-z]*[0-9]+[a-z0-9]*)\.\d{1,})\s*(?:[a-z][a-z]+)\s*\d{1,}\s*.\s*.\s*((?:[a-z][a-z]*[0-9]+[a-z0-9]*))
But it does not seem to be working when I test at:
http://www.pythonregex.com/
Could someone point to any obvious error I am doing here.
Thanks,
~Newbie

Try this:
^\* [0-9]{3} +([0-9]{4}.[0-9a-z]{4}.[0-9a-z]{4}).*(Veth[0-9]{4})$
Debuggex Demo
The first part is in capture group one, the "Veth" code in capture group two.
Please consider bookmarking the Stack Overflow Regular Expressions FAQ for future reference. There's a list of online testers in the bottom section.

I don't think you need a regex for this:
for line in open('myfile','r').readlines():
fields = line.split( )
print "\n" + fields[1] + "\n" +fields[6]

A very strict version would look something like this:
^\*\s+\d{3}\s+(\d{4}(?:\.[0-9a-f]{4}){2})\s+\w+\s+\d+\s+\w\s+\w\s+([0-9A-Za-z]+)$
Debuggex Demo
Here I assume that:
the columns will be pretty much the same,
your first match group contains a group of decimal digits and two groups of lower-case hex digits,
and the last word can be anything.
A few notes:
\d+ is equivalent to \d{1,} or [0-9]{1,}, but reads better (imo)
use \. to match a literal ., as . would simply match anything
[a-z]{2} is equivalent to [a-z][a-z], but reads better (my opinion, again)
however, you might want to use \w instead to match a word character

This will do it:
reobj = re.compile(r"^.*?([\w]{4}\.[\w]{4}\.[\w]{4}).*?([\w]+)$", re.IGNORECASE | re.MULTILINE)
match = reobj.search(subject)
if match:
group1 = match.group(1)
group2 = match.group(2)
else:
result = ""

Related

Regex to fix (all the matches or none) at the end to one

I'm trying to fix the . at the end to only one in a string. For example,
line = "python...is...fun..."
I have the regex \.*$ in Ruby, which is to be replaced by a single ., as in this demo, which don't seem to work as expected. I've searched for similar posts, and the closest I'd got is this answer in Python, which suggests the following,
>>> text1 = 'python...is...fun...'
>>> new_text = re.sub(r"\.+$", ".", text1)
>>> 'python...is...fun.'
But, it fails if I've no . at the end. So, I've tried like \b\.*$, as seen here, but this fails on the 3rd test which has some ?'s at end.
My question is, why \.*$ not matches all the .'s (despite of being greedy) and how to do the problem correctly?
Expected output:
python...is...fun.
python...is...fun.
python...is...fun??.
You might use an alternation matching either 2 or more dots or assert that what is directly to the left is not one of for example ! ? or a dot itself.
In the replacement use a single dot.
(?:\.{2,}|(?<!\.))$
Explanation
(?: Non capture group for the alternation
\.{2,} Match 2 or more dots
| Or
(?<!\.) Get the position where directly to the left is not a . (which you can extend with other characters as desired)
) Close non capture group
$ End of string (Or use \Z if there can be no newline following)
Regex demo | Python demo
For example
import re
strings = [
"python...is...fun...",
"python...is...fun",
"python...is...fun??"
]
for s in strings:
new_text = re.sub(r"(?:\.{2,}|(?<!\.))$", ".", s)
print(new_text)
Output
python...is...fun.
python...is...fun.
python...is...fun??.
If an empty string should not be replaced by a dot, you can use a positive lookbehind.
(?:\.{2,}|(?<=[^\s.]))$
Regex demo

insert space between regex match

I want to un-join typos in my string by locating them using regex and insert a space character between the matched expression.
I tried the solution to a similar question ... but it did not work for me -(Insert space between characters regex); solution- to use the replace string as '\1 \2' in re.sub .
import re
corpus = '''
This is my corpus1a.I am looking to convert it into a 2corpus 2b.
'''
clean = re.compile('\.[^(\d,\s)]')
corpus = re.sub(clean,' ', corpus)
clean2 = re.compile('\d+[^(\d,\s,\.)]')
corpus = re.sub(clean2,'\1 \2', corpus)
EXPECTED OUTPUT:
This is my corpus 1 a. I am looking to convert it into a 2 corpus 2 b.
You need to put the capture group parentheses around the patterns that match each string that you want to copy to the result.
There's also no need to use + after \d. You only need to match the last digit of the number.
clean = re.compile(r'(\d)([^\d,\s])')
corpus = re.sub(clean, r'\1 \2', corpus)
DEMO
I'm not sure about other possible inputs, we might be able to add spaces using an expression similar to:
(\d+)([a-z]+)\b
after that we would replace any two spaces with a single space and it might work, not sure though:
import re
print(re.sub(r"\s{2,}", " ", re.sub(r"(\d+)([a-z]+)\b", " \\1 \\2", "This is my corpus1a.I am looking to convert it into a 2corpus 2b")))
The expression is explained on the top right panel of this demo, if you wish to explore further or modify it, and in this link, you can watch how it would match against some sample inputs step by step, if you like.
Capture groups, marked by parenthesis ( and ), should be around the patterns you want to match.
So this should work for you
clean = re.compile(r'(\d+)([^\d,\s])')
corpus = re.sub(clean,'\1 \2', corpus)
The regex (\d+)([^\d,\s]) reads: match 1 or more digits (\d+) as group 1 (first set of parenthesis), match non-digit and non-whitespace as group 2.
The reason why your's doesn't work was that you did not have parenthesis surrounding the patterns you want to reuse.

python regExp search with lookarounds

In my test program I get an input that goes like
str = "TestID277RStep01CtrAx-mn00112345"
Here, I want to use regExp to form groups that return me the following
str = "Test(ID277)(R)(Step01)(CtrAx-mn001)12345"
My goal is to end up with 4 vars
var1 = "ID277"
var2 = "R"
var3 = "Step01"
var4 = "CtrAx-mn001"
I have so far tried
regx = ".*Test(ID[0-9]+)([RP]?)(Step(?=\d)\d+)?(Ctr(?=[A-Z][a-z]-/d{3}))?.*"
re_testInp = re.compile ( regx, re.IGNORECASE )
srch = re_testInp.search( r'^' + str )
print srch.groups()
I seem to be getting the first 3 groups right but unable to get the last one.
Almost close to pulling all my hair out with this one. Any help will be much appreciated.
Works for me fine with Python3.6.0 and the following pattern:
.*Test(ID[0-9]+)([RP]?)(Step(?=\d)\d+)?(.*\-(?=[A-Za-z][a-z]\d{3})[A-Za-z][a-z]\d{3})?.*
I only changed the last capturing group as I'll explain what was wrong, in my opinion, with the pattern you included:
.*Test(ID[0-9]+)([RP]?)(Step(?=\d)\d+)?(Ctr(?=[A-Z][a-z]/d{3}))?.*
Do notice that the capture group in bold will not find a match because:
You attempt to match a literal 'Ctr', also you did not consider the literal '-'. I do not know what is the possible text you try to match there exactly but I generalized it to: .*-
You wrote /d{3} instead of \d{3}
In the test string you included: '...ReqAx-mn...' the m is lower cased. You should change the pattern to: (Ctr(?=[A-Za-z][a-z]/d{3})) if you want to support lowercase as well.
You do not use the lookahead assertion properly. As stated in: https://docs.python.org/3/library/re.html
(?=...)
Matches if ... matches next, but doesn’t consume any of the string.
This is called a lookahead assertion. For example, Isaac (?=Asimov)
will match 'Isaac ' only if it’s followed by 'Asimov'.
Meaning you should change the capturing group to: (.*-(?=[A-Za-z][a-z]\d{3})[A-Za-z][a-z]\d{3})
In: (Step(?=\d)\d+) I assume you thought the first digit would be captured in the lookahead assertion, but both digits are captured by the following \d+
Ben.

How to pull out language via regex

I have the following two string:
s1 = 'Audio: Dolby Digital 5.1 (English)'
s2 = 'Audio: Stereo (English, French)'
I want to pull out the first language in each string. Here is what I have so far:
re.search(r'\s\((.+)', s1)
['English)']
How would I improve this to work on both of the above?
You could use this which will only find the first language and it is only a small tweak to your existing code
f=re.findall(r'\((\w+)', s1)
e=re.findall(r'\((\w+)', s2)
if f:
print f
if e:
print e
f = ['English']
e = ['English']
if you only want the first language then you should be using search instead like so
f = re.search(r'\((\w+)', s1)
e = re.search(r'\((\w+)', s2)
if f:
print f.group(1)
if e:
print e.group(1)
This will print a string rather than a list since it is only finding one thing
Widen the search to start the phrase with a parenthesis or comma+space, and end with a parenthesis or comma+space:
>>> re.findall(r'\s(?:\(|, )(.+)(?:\)|, )', s2)
['English, French']
The ?: after a parenthesis indicates a non-capturing group.
You can then grab whichever language you're interested in with indexing.
Since the strings you're searching are actually pretty tidy, you can also do this without regex:
>>> s1.split('(')[1].split(')')[0].split(', ')[0]
'English'
>>> s2.split('(')[1].split(')')[0].split(', ')[0]
'English'
You can just use this simple modification of your regular expression:
\s\(([^,\n\)]+)
Regex101
You're looking for the text after the first LParen and before the first comma. So, a regex that would match this is:
\(([^,]*),
(Your answer will be in group 1)
Finally, I'd like to point you to https://www.debuggex.com/, which will help you easily visualize your regex questions.
Assuming languages are always at the end, surrounded by brackets and listed with ,:
(?<=\()\w+(?=(?:, \w+)*\)$)
See it in action
The idea is:
(?<=\() - the string should be preceded by an opening bracket(()
\w+ - the language itself is a sequence of letters
(?=(?:, \w+)*\)$) - after it, there can be zero or more other languages, separated with comma and space and after closing the bracket()) leaves us at the end of the string

Python Regex behaviour with Square Brackets []

This the text file abc.txt
abc.txt
aa:s0:education.gov.in
bb:s1:defence.gov.in
cc:s2:finance.gov.in
I'm trying to parse this file by tokenizing (correct me if this is the incorrect term :) ) at every ":" using the following regular expression.
parser.py
import re,sys,os,subprocess
path = "C:\abc.txt"
site_list = open(path,'r')
for line in site_list:
site_line = re.search(r'(\w)*:(\w)*:([\w\W]*\.[\W\w]*\.[\W\w]*)',line)
print('Regex found that site_line.group(2) = '+str(site_line.group(2))
Why is the output
Regex found that site_line.group(2) = 0
Regex found that site_line.group(2) = 1
Regex found that site_line.group(2) = 2
Can someone please help me understand why it matches the last character of the second group ? I think its matching 0 from s0 , 1 from s1 & 2 from s2
But Why ?
Let's show a simplified example:
>>> re.search(r'(.)*', 'asdf').group(1)
'f'
>>> re.search(r'(.*)', 'asdf').group(1)
'asdf'
If you have a repetition operator around a capturing group, the group stores the last repetition. Putting the group around the repetition operator does what you want.
If you were expecting to see data from the third group, that would be group(3). group(0) is the whole match, and group(1), group(2), etc. count through the actual parenthesized capturing groups.
That said, as the comments suggest, regexes are overkill for this.
>>> 'aa:s0:education.gov.in'.split(':')
['aa', 's0', 'education.gov.in']
And first group is entire match by default.
If a groupN argument is zero, the corresponding return value is the
entire matching string.
So you should skip it. And check group(3), if you want last one.
Also, you should compile regexp before for-loop. It increase performance of your parser.
And you can replace (\w)* to (\w*), if you want match all symbols between :.

Categories

Resources