How to retrieve second value from search object? - python

I am unable to obtain the second match from a match object.
EDIT: found a fix
ptrn_buy_in = re.compile("(?<=[$€£])([0-9]+[.,][0-9]+)")
buy_in = re.findall(ptrn_buy_in, lines[0])
if buy_in[1] and buy_in[2]:
parsed_hand["metadata"]["buy_in"] = float(buy_in[1])
parsed_hand["metadata"]["rake"] = float(buy_in[2])
My string is: $14.69+$0.31
I have tried to see if the match object actually holds more values within the same group i.e. .group(0)[0] and [1]. This actually gave me the second digit of the number so not at all what I was expecting.
ptrn_buy_in = re.compile("(?<=[$€£])([0-9]+.[0-9]+)")
buy_in = re.search(ptrn_buy_in, lines[0])
if buy_in.group(0) and buy_in.group(1):
parsed_hand["metadata"]["buy_in"] = float(buy_in.group(0))
parsed_hand["metadata"]["rake"] = float(buy_in.group(1))
I am expecting to get 14.69 in .group(0) and 0.31 from .group(1) however I am getting 14.69 twice. Does anyone have any ideas?
Kind regards

There is only 1 capturing group ([0-9]+.[0-9]+) which will match 1+ digits, 1 times any character and again 1+ digits.
re.search returns a MatchObject which where .group(0) returns the entire match and .group(1) the first capturing group. That is why you get 14.69 twice.
You have to escape the dot to match it literally. But as you use a positive lookbehind, you could omit the group and get the match only:
(?<=[$€£])[0-9]+\.[0-9]+
ptrn_buy_in = re.compile("(?<=[$€£])[0-9]+\.[0-9]+")
print(re.findall(ptrn_buy_in, r"$14.69+$0.31"))
Regex demo | Python demo
Result
['14.69', '0.31']
Or use a match with the capturing group:
[$€£]([0-9]+\.[0-9]+)
Regex demo | Python demo

Related

Regex unbalanced parenthesis error while finding phone numbers

I am using Regex and got an error of unbalanced parenthesis while finding phone numbers at position 32. Image of error is given
import re
Regex_digit = re.compile(r'((\d\d\d-)? \d\d\d-\d\d\d\d(,))?) {3}')
Regex_digit.search('Hello, you can call me at 144-245-1452,152-632,414-156-3552')
enter image description here
To get the supposed phone numbers in the string:
'Hello, you can call me at 144-245-1452,152-632,414-156-3552'
We expect the output to be able to capture all of the followings;
144-245-1452
414-156-3552
Given your regex, the pattern that would work would be;
Regex_digit = re.compile(r'(\d\d\d-)?(\d\d\d)-(\d\d\d\d)')
There are 3 capturing groups here.
(\d\d\d-)? - optionally matches 3 digits [0-9].
(\d\d\d) - matches 3 digits [0-9]. Exactly 3 times.
- - matches the character - literally.
(\d\d\d\d) - matches 4 digits [0-9].
result = Regex_digit.search('Hello, you can call me at 144-245-1452,152-632,414-156-3552')
print(result)
<re.Match object; span=(26, 38), match='144-245-1452'>
However search would only give you the first matching pattern in the string.
To get all matching pattern;
string = 'Hello, you can call me at 144-245-1452,152-632,414-156-3552'
result = Regex_digit.findall(string)
print(result)
Because we have 3 capturing groups, you'd get a list of tuples. With each tuple containing 3 items from the capturing group.
To get back the result as a list of strings, you can use the join method;
print(["".join(x) for x in re.findall(Regex_digit, s)])
Refactored pattern would be;
regex_digit = re.compile(r'(\d{3}-)?(\d{3})-(\d{4})')
It means same thing as discussed above.
in your case thus error occurres because you have more closing parenthesis than opening parenthesis, to be balanced it will be like this:
r'((\d\d\d-)? \d\d\d-\d\d\d\d(,)?){3}'

python regex return non-capturing group

I want to generate a username from an email with :
firstname's first letter
lastname's first 7 letters
eg :
getUsername("my-firstname.my-lastname#email.com")
mmylastn
Here is getUsername's code :
def getUsername(email) :
re.match(r"(.){1}[a-z]+.([a-z]{7})",email.replace('-','')).group()
email.replace('-','') to get rid of the - symbol
regex that captures the 2 groups I discribed above
If I do .group(1,2) I can see the captured groups are m and mylastn, so it's all good.
But using .group() doesn't just return the capturing group but also everthing between them : myfirstnamemlastn
Can someone explain me this behavior ?
First of all, a . in a pattern is a metacharacter that matches any char excluding line break chars. You need to escape the . in the regex pattern
Also, {1} limiting quantifier is always redundant, you may safely remove it from any regex you have.
Next, if you need to get a mmylastn string as a result, you cannot use match.group() because .group() fetches the overall match value, not the concatenated capturing group values.
So, in your case,
Check if there is a match first, trying to access None.groups() will throw an exception
Then join the match.groups()
You can use
import re
def getUsername(email) :
m = re.match(r"(.)[a-z]+\.([a-z]{7})",email.replace('-',''))
if m:
return "".join(m.groups())
return email
print(getUsername("my-firstname.my-lastname#email.com"))
See the Python demo.

Fetching respective group values in a regex expression

I have an example string like below:
Handling - Uncrating of 3 crates - USD600 each 7%=126.00 1,800.00
I can have another example string that can be like:
Unpacking/Unremoval fee Zero Rated 100.00
I am trying to access the first set of words and the last number values.
So I want the dict to be
{'Handling - Uncrating of 3 crates - USD600 each':1800.00}
or
{'Unpacking/Unremoval fee':100.00}
There might be strings where none of the above patterns (Zero Rated or something with %) present and I would skip those strings.
To do that, I was regexing the following pattern
pattern = re.search(r'(.*)Zero.*Rated\s*(\S*)',line.strip())
and then
pattern.group(1)
gives the keys for dict and
pattern.group(2)
gives the value of 1800.00. This works for lines where Zero Rated is present.
However if I want to also check for pattern where Zero Rated is not present but % is present as in first example above, I was trying to use | but it didn't work.
pattern = re.search(r'(.*)Zero.*Rated|%\s*(\S*)',line.strip())
But this time I am not getting the right pattern groups as it is fetching groups.
Sites like regex101.com can help debug regexes.
In this case, the problem is with operator precedence; the | operates over the whole of the rest of the regex. You can group parts of the regex without creating additional groups with (?: )
Try: r'(.*)(?:Zero.*Rated|%)\s*(\S*)'
Definitely give regex101.com a go, though, it'll show you what's going on in the regex.
You might use
^(.+?)\s*(?:Zero Rated|\d+%=\d{1,3}(?:\,\d{3})*\.\d{2})\s*(\d{1,3}(?:,\d{3})*\.\d{2})
The pattern matches
^ Start of string
(.+?) Capture group 1, match any char except a newline as least as possible
\s* Match 0+ whitespace chars
(?: Non capture group
Zero Rated Match literally
| Or
\d+%= Match 1+ digits and %=
\d{1,3}(?:\,\d{3})*\.\d{2} Match a digit format of 1-3 digits, optionally repeated by a comma and 3 digits followed by a dot and 2 digits
) Close non capture group
\s* Match 0+ whitespace chars
(\d{1,3}(?:,\d{3})*\.\d{2}) Capture group 2, match the digit format
Regex demo | Python demo
For example
import re
regex = r"^(.+?)\s*(?:Zero Rated|\d+%=\d{1,3}(?:\,\d{3})*\.\d{2})\s*(\d{1,3}(?:,\d{3})*\.\d{2})"
test_str = ("Handling - Uncrating of 3 crates - USD600 each 7%=126.00 1,800.00\n"
"Unpacking/Unremoval fee Zero Rated 100.00\n"
"Delivery Cartage - IT Equipment, up to 1000kgs - 7%=210.00 3,000.00")
print(dict(re.findall(regex, test_str, re.MULTILINE)))
Output
{'Handling - Uncrating of 3 crates - USD600 each': '1,800.00', 'Unpacking/Unremoval fee': '100.00', 'Delivery Cartage - IT Equipment, up to 1000kgs -': '3,000.00'}

getting a certain group from regex match

I have a .txt file with some text (copied from an edifact file) and i wanted to match certain fields, I basically just want the date (match 1, group 0)
this is the regex that I have
https://regex101.com/r/oSVlS8/6
but I cant implement it in my code, I only want group 0 of the match.
here is my code:
regex = r"^((?:INV)\+(?:[^+\n]*\+){4})\d{8}"
with open ("test edifakt 1 bk v1.txt", "r") as f:
result = re.findall(regex,f.read(),re.MULTILINE)
print(result)
and this is what i get as a result:
['INV+ED Format 1+Brustkrebs+19880117+E000000001+']
I actually want "20080702" instead
I tried things like print(result.group(0)), but that didn't work. I got:
AttributeError: 'list' object has no attribute 'group'
I also tried it as an argument like this result = re.findall(regex,f.read(),group(0),re.MULTILINE) but I get
get NameError: name 'group' is not defined
can I really only match a certain group if I'm using re.search and its a string?
You could change the capturing group to capture the digits instead.
Note that you can omit the non capturing group around INV (?:INV) and using * as the quantifier for [^+\n]*\+ could possibly also match 4 consecutive plus signs ++++
^INV\+(?:[^+\n]*\+){4}(\d{8})
^ Start of string
INV\+ match INV+
(?: Non capturing group
[^+\n]*\+ Match 0+ times any char except a + or newline
){4} Close group and repeat 4 times
(\d{8}) Capture group 1, match 8 digits
Regex demo | Python demo
For example
regex = r"^INV\+(?:[^+\n]*\+){4}(\d{8})"
result = re.findall(regex, test_str, re.MULTILINE)
print(result)
Output
['20080702']
If you want to use the group method, you could use
matches = re.search(regex, test_str, re.MULTILINE)
if matches:
print(matches.group(1))
Output
20080702
Python demo
re.findall will return the value of the capture group.
re.search will return a match object which has a group method.
Try this regex
re.search(r'(?:INV)\+(?:[^+\n]*\+){4}(\d{8})', text).group(1)
Returns
'20080702'

python regExp search with lookarounds

In my test program I get an input that goes like
str = "TestID277RStep01CtrAx-mn00112345"
Here, I want to use regExp to form groups that return me the following
str = "Test(ID277)(R)(Step01)(CtrAx-mn001)12345"
My goal is to end up with 4 vars
var1 = "ID277"
var2 = "R"
var3 = "Step01"
var4 = "CtrAx-mn001"
I have so far tried
regx = ".*Test(ID[0-9]+)([RP]?)(Step(?=\d)\d+)?(Ctr(?=[A-Z][a-z]-/d{3}))?.*"
re_testInp = re.compile ( regx, re.IGNORECASE )
srch = re_testInp.search( r'^' + str )
print srch.groups()
I seem to be getting the first 3 groups right but unable to get the last one.
Almost close to pulling all my hair out with this one. Any help will be much appreciated.
Works for me fine with Python3.6.0 and the following pattern:
.*Test(ID[0-9]+)([RP]?)(Step(?=\d)\d+)?(.*\-(?=[A-Za-z][a-z]\d{3})[A-Za-z][a-z]\d{3})?.*
I only changed the last capturing group as I'll explain what was wrong, in my opinion, with the pattern you included:
.*Test(ID[0-9]+)([RP]?)(Step(?=\d)\d+)?(Ctr(?=[A-Z][a-z]/d{3}))?.*
Do notice that the capture group in bold will not find a match because:
You attempt to match a literal 'Ctr', also you did not consider the literal '-'. I do not know what is the possible text you try to match there exactly but I generalized it to: .*-
You wrote /d{3} instead of \d{3}
In the test string you included: '...ReqAx-mn...' the m is lower cased. You should change the pattern to: (Ctr(?=[A-Za-z][a-z]/d{3})) if you want to support lowercase as well.
You do not use the lookahead assertion properly. As stated in: https://docs.python.org/3/library/re.html
(?=...)
Matches if ... matches next, but doesn’t consume any of the string.
This is called a lookahead assertion. For example, Isaac (?=Asimov)
will match 'Isaac ' only if it’s followed by 'Asimov'.
Meaning you should change the capturing group to: (.*-(?=[A-Za-z][a-z]\d{3})[A-Za-z][a-z]\d{3})
In: (Step(?=\d)\d+) I assume you thought the first digit would be captured in the lookahead assertion, but both digits are captured by the following \d+
Ben.

Categories

Resources