getting a certain group from regex match - python

I have a .txt file with some text (copied from an edifact file) and i wanted to match certain fields, I basically just want the date (match 1, group 0)
this is the regex that I have
https://regex101.com/r/oSVlS8/6
but I cant implement it in my code, I only want group 0 of the match.
here is my code:
regex = r"^((?:INV)\+(?:[^+\n]*\+){4})\d{8}"
with open ("test edifakt 1 bk v1.txt", "r") as f:
result = re.findall(regex,f.read(),re.MULTILINE)
print(result)
and this is what i get as a result:
['INV+ED Format 1+Brustkrebs+19880117+E000000001+']
I actually want "20080702" instead
I tried things like print(result.group(0)), but that didn't work. I got:
AttributeError: 'list' object has no attribute 'group'
I also tried it as an argument like this result = re.findall(regex,f.read(),group(0),re.MULTILINE) but I get
get NameError: name 'group' is not defined
can I really only match a certain group if I'm using re.search and its a string?

You could change the capturing group to capture the digits instead.
Note that you can omit the non capturing group around INV (?:INV) and using * as the quantifier for [^+\n]*\+ could possibly also match 4 consecutive plus signs ++++
^INV\+(?:[^+\n]*\+){4}(\d{8})
^ Start of string
INV\+ match INV+
(?: Non capturing group
[^+\n]*\+ Match 0+ times any char except a + or newline
){4} Close group and repeat 4 times
(\d{8}) Capture group 1, match 8 digits
Regex demo | Python demo
For example
regex = r"^INV\+(?:[^+\n]*\+){4}(\d{8})"
result = re.findall(regex, test_str, re.MULTILINE)
print(result)
Output
['20080702']
If you want to use the group method, you could use
matches = re.search(regex, test_str, re.MULTILINE)
if matches:
print(matches.group(1))
Output
20080702
Python demo
re.findall will return the value of the capture group.
re.search will return a match object which has a group method.

Try this regex
re.search(r'(?:INV)\+(?:[^+\n]*\+){4}(\d{8})', text).group(1)
Returns
'20080702'

Related

I want to extract gene boundaries(like 1..234, 234..456) from a file using regex in python but every time I use this code it returns me empty list

I want to extract gene boundaries (like 1..234, 234..456) from a file using regex in python but every time I use this code it returns me empty list.
Below is example file
Below is what i have so far:
import re
#with open('boundaries.txt','a') as wf:
with open('sequence.gb','r') as rf:
for line in rf:
x= re.findall(r"^\s+\w+\s+\d+\W\d+",line)
print(x)
The pattern does not match, as you are matching a single non word character after matching the first digits that you encounter.
You can repeat matching those 1 or more times.
As you want to have a single match from the start of the string, you can also use re.match without the anchor ^
^\s+\w+\s+\d+\W+\d+
^
Regex demo
import re
s=" gene 1..3256"
pattern = r"\s+\w+\s+\d+\W+\d+"
m = re.match(pattern, s)
if m:
print(m.group())
Output
gene 1..3256
Maybe you used the wrong regex.Try the code below.
for line in rf:
x = re.findall(r"g+.*\s*\d+",line)
print(x)
You can also use online regex 101, to test your regex pattern online.
online regex 101
More suitable pattern would be: ^\s*gene\s*(\d+\.\.\d+)
Explanation:
^ - match beginning of a line
\s* - match zero or more whitespaces
gene - match gene literally
(...) - capturing group
\d+ - match one or more digits
\.\. - match .. literally
Then, it's enough to get match from first capturing group to get gene boundaries.

python regex return non-capturing group

I want to generate a username from an email with :
firstname's first letter
lastname's first 7 letters
eg :
getUsername("my-firstname.my-lastname#email.com")
mmylastn
Here is getUsername's code :
def getUsername(email) :
re.match(r"(.){1}[a-z]+.([a-z]{7})",email.replace('-','')).group()
email.replace('-','') to get rid of the - symbol
regex that captures the 2 groups I discribed above
If I do .group(1,2) I can see the captured groups are m and mylastn, so it's all good.
But using .group() doesn't just return the capturing group but also everthing between them : myfirstnamemlastn
Can someone explain me this behavior ?
First of all, a . in a pattern is a metacharacter that matches any char excluding line break chars. You need to escape the . in the regex pattern
Also, {1} limiting quantifier is always redundant, you may safely remove it from any regex you have.
Next, if you need to get a mmylastn string as a result, you cannot use match.group() because .group() fetches the overall match value, not the concatenated capturing group values.
So, in your case,
Check if there is a match first, trying to access None.groups() will throw an exception
Then join the match.groups()
You can use
import re
def getUsername(email) :
m = re.match(r"(.)[a-z]+\.([a-z]{7})",email.replace('-',''))
if m:
return "".join(m.groups())
return email
print(getUsername("my-firstname.my-lastname#email.com"))
See the Python demo.

Regular expression for pandoc-markdown citations

I'm trying to search and replace citations from pandoc-markdown.
They have the following syntax:
[prenote #autorkey, postnote]
Or for more than one Author
[prenote1 #authorekey1, postnote1; prenote2 #authorkey2, postnote2]
The pre-notes, the author-keys and the post-notes should each be in their own capture group.
For only one author in a citation I used regex this:
\[((.*) )?#(.*?)(, (.*))?\]
But I can't figure out how to match a citation with multiple authors.
Ideally it would be possible to match citations with one or more author keys.
The pre-note and the post-note should be optional.
Is this possible?
We need more context with code (full sample code) to be able to answer fully, so I can only answer in the same general way in which you asked the question.
I do not believe you can do it in one operation with one regular expression.
So the overall technique I would use is:
First match the entire citation (with one or more authors) using a simple regex with only one group, namely for everything between [ and ].
Then, when a match is found, split what is in that match (i.e. everything between the square brackets) by ; to get a list of "prenote #authorkey, postnote" strings.
Do the wanted replacements on each element in that resulting list of single author strings.
Stitch together the final citation by joining the resulting list with semicolons again and adding [ and ] in around it.
Put that final citation in the original instead of the matched string.
You can put steps 2 to 4 in a function f(match_object), and then use re.sub(pattern, f, string) to do the replacement. It will call function f for each match it finds, and replace that match with the return value of f.
You might make use of the PyPi regex module to get the 3 capturing groups.
(?:\G(?!^)|\[(?=[^][\r\n]*\]))[^\S\r\n]*(.*?) #(.*?), ([^][,\r\n]*)[\];]
Regex demo | Python demo
Explanation
(?: Non capture group
\G(?!^) Assert the position at the end of the previous match, not at the start
| Or
\[(?=[^][\r\n]*\]) Match [ and assert that there is a closing ]
) Close non capture group
[^\S\r\n]* Match 0+ occurrences of a whitespace char except a newline
(.*?) Capture group 1, match any char except a newline as least as possible
# Match literally
(.*?) Capture group 2, match any char except a newline as least as possible
, Match literally
([^][,\r\n]*) Capture group 3, match any char except ] [ , or a newline
[\];] Match either ] or ;
Example code using regex.finditer
import regex
pattern = r"(?:\G(?!^)|\[(?=[^][\r\n]*\]))[^\S\r\n]*(.*?) #(.*?), ([^][,\r\n]*)[\];]"
test_str = ("[prenote #autorkey, postnote]\n"
"[prenote1 #authorekey1, postnote1; prenote2 #authorkey2, postnote2]\n")
matches = regex.finditer(pattern, test_str)
for matchNum, match in enumerate(matches, start=1):
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print (match.group(groupNum))
Output
prenote
autorkey
postnote
prenote1
authorekey1
postnote1
prenote2
authorkey2
postnote2

Python: Non capturing group is not working in Regex

I'm using non-capturing group in regex i.e., (?:.*) but it's not working.
I can still able to see it in the result. How to ignore it/not capture in the result?
Code:
import re
text = '12:37:25.790 08/05/20 Something P LR 0.156462 sccm Pt 25.341343 psig something-else'
pattern = ['(?P<time>\d\d:\d\d:\d\d.\d\d\d)\s{1}',
'(?P<date>\d\d/\d\d/\d\d)\s',
'(?P<pr>(?:.*)Pt\s{3}\d*[.]?\d*\s[a-z]+)'
]
result = re.search(r''.join(pattern), text)
Output:
>>> result.group('pr')
'Something P LR 0.156462 sccm Pt 25.341343 psig'
Expected output:
'Pt 25.341343 psig'
More info:
>>> result.groups()
('12:37:25.790', '08/05/20', 'Something P LR 0.156462 sccm Pt 25.341343 psig')
The quantifier is inside the named group, you have to place it outside and possibly make it non greedy to.
The updated pattern could look like:
(?P<time>\d\d:\d\d:\d\d.\d\d\d)\s{1}(?P<date>\d\d/\d\d/\d\d)\s.*?(?P<pr>Pt\s{3}\d*[.]?\d*\s[a-z]+)
Note that with he current pattern, the number is optional as all the quantifiers are optional. You can omit {1} as well.
If the number after Pt can not be empty, you can update the pattern using \d+(?:\.\d+)? matching at least a single digit:
(?P<time>\d\d:\d\d:\d\d.\d{3})\s(?P<date>\d\d/\d\d/\d\d)\s.*?(?P<pr>Pt\s{3}\d+(?:\.\d+)?\s[a-z]+)
(?P<time> Group time
\d\d:\d\d:\d\d.\d{3} Match a time like format
)\s Close group and match a whitespace char
(?P<date> Group date
\d\d/\d\d/\d\d Match a date like pattern
)\s Close group and match a whitespace char
.*? Match any char except a newline, as least as possible
(?P<pr> Group pr
Pt\s{3} Match Pt and 3 whitespace chars
\d+(?:\.\d+)? Match 1+ digits with an optional decimal part
\s[a-z]+ Match a whitespace char an 1+ times a char a-z
) Close group
Regex demo
Remove the non-capturing group from your named group. Using a non-capturing group means that no new group will be created in the match, not that that part of the string will be removed from any including group.
import re
text = 'Something P LR 0.156462 sccm Pt 25.341343 psig something-else'
pattern = r'(?:.*)(?P<pr>Pt\s{3}\d*[.]?\d*\s[a-z]+)'
result = re.search(pattern, text)
print(result.group('pr'))
Output:
Pt 25.341343 psig
Note that the specific non-capturing group you used can be excluded completely, as it basically means that you want your regex to be preceded by anything, and that's what search will do anyway.
I think there is a confusion regarding the meaning of "non-capturing" here: It does not mean that the result omits this part, but that no match group is created in the result.
Example where the same regex is executed with a capturing, and a non-capturing group:
>>> import re
>>> match = re.search(r'(?P<grp>foo(.*))', 'foobar')
>>> match.groups()
('foobar', 'bar')
>>> match = re.search(r'(?P<grp>foo(?:.*))', 'foobar')
>>> match.groups()
('foobar',)
Note that match.group(0) is the same in both cases (group 0 contains the matching part of the string in full).

How to retrieve second value from search object?

I am unable to obtain the second match from a match object.
EDIT: found a fix
ptrn_buy_in = re.compile("(?<=[$€£])([0-9]+[.,][0-9]+)")
buy_in = re.findall(ptrn_buy_in, lines[0])
if buy_in[1] and buy_in[2]:
parsed_hand["metadata"]["buy_in"] = float(buy_in[1])
parsed_hand["metadata"]["rake"] = float(buy_in[2])
My string is: $14.69+$0.31
I have tried to see if the match object actually holds more values within the same group i.e. .group(0)[0] and [1]. This actually gave me the second digit of the number so not at all what I was expecting.
ptrn_buy_in = re.compile("(?<=[$€£])([0-9]+.[0-9]+)")
buy_in = re.search(ptrn_buy_in, lines[0])
if buy_in.group(0) and buy_in.group(1):
parsed_hand["metadata"]["buy_in"] = float(buy_in.group(0))
parsed_hand["metadata"]["rake"] = float(buy_in.group(1))
I am expecting to get 14.69 in .group(0) and 0.31 from .group(1) however I am getting 14.69 twice. Does anyone have any ideas?
Kind regards
There is only 1 capturing group ([0-9]+.[0-9]+) which will match 1+ digits, 1 times any character and again 1+ digits.
re.search returns a MatchObject which where .group(0) returns the entire match and .group(1) the first capturing group. That is why you get 14.69 twice.
You have to escape the dot to match it literally. But as you use a positive lookbehind, you could omit the group and get the match only:
(?<=[$€£])[0-9]+\.[0-9]+
ptrn_buy_in = re.compile("(?<=[$€£])[0-9]+\.[0-9]+")
print(re.findall(ptrn_buy_in, r"$14.69+$0.31"))
Regex demo | Python demo
Result
['14.69', '0.31']
Or use a match with the capturing group:
[$€£]([0-9]+\.[0-9]+)
Regex demo | Python demo

Categories

Resources