Extract age from a string-python - python

Consider this string:
s="""A25-54 plus affinities targeting,Demo (AA F21-54),
A25-49 Artist Affinity Targeting,M21-49 plus,plus plus A 21+ targeting"""
I am looking to fix my pattern which currently does not pull all the age groups in the string (A 21+ is missing from the current output).
Current try:
import re
re.findall(r'(?:A|A |AA F|M)(\d+-\d+)',s)
Output:
['25-54', '21-54', '25-49', '21-49'] #doesnot capture the last group A 21+
Expected Output:
['A25-54','AA F21-54','A25-49','M21-49','A 21+']
As you see, I would like to have the last group too which is A 21+ which is currently missing from my output.
Also if I can get the string associated with the capture group. presently my output apart from not capturing all the groups doesnt have the string before the age group. eg: I want 'A25-54 instead of '25-54' , i guess because of ?: .
Appreciate any help I can get.

The missing part of the match is due to the fact your pattern contains one capturing group and once there is a capturing group in the regex, the re.findall only returns that parts. The second issue is that you should match either - followed with 1 or more digits or a literal + symbol after the first one or more digits are matched.
You may use
(?:A|A |AA F|M)\d+(?:-\d+|\+)
NOTE: You might want to add a word boundary at the start to only match those A, AA F, etc. as whole words: r'\b(?:A|A |AA F|M)\d+(?:-\d+|\+)'.
See the regex demo and the regex graph:
Details
(?:A|A |AA F|M) - a non-capturing group matching A, A , AA , AA F or M
\d+ - 1+ digits
(?:-\d+|\+) - a non-capturing group matching - and 1+ digits after it or a single + symbol.
Python demo:
import re
s="""A25-54 plus affinities targeting,Demo (AA F21-54),
A25-49 Artist Affinity Targeting,M21-49 plus,plus plus A 21+ targeting"""
print(re.findall(r'(?:A|A |AA F|M)\d+(?:-\d+|\+)',s))
# => ['A25-54', 'AA F21-54', 'A25-49', 'M21-49', 'A 21+']

Related

Regex to Match Pattern 5ABXYXYXY

I am working on mobile number of 9 digits.
I want to use regex to match numbers with pattern 5ABXYXYXY.
A sample I have is 529434343
What I have tried
I have the below pattern to match it.
r"^\d*(\d)(\d)(?:\1\2){2}\d*$"
However, this pattern matches another pattern I have which is 5XXXXXXAB
a sample for that is 555555532.
What I want I want to edit my regex to match the first pattern only 5ABXYXYXY and ignore this one 5XXXXXXAB
You can use
^\d*((\d)(?!\2)\d)\1{2}$
See the regex demo.
Details:
^ - start of string
\d* - zero or more digits
((\d)(?!\2)\d) - Group 1: a digit (captured into Group 2), then another digit (not the same as the preceding one)
\1{2} - two occurrences of Group 1 value
$ - end of string.
To match 5ABXYXYXY where AB should not be same as XY matching 3 times, you may use this regex:
^\d*(\d{2})(?!\1)((\d)(?!\3)\d)\2{2}$
RegEx Demo
RegEx Breakup:
^: Start
\d*: Match 0 or more digits
(\d{2}): Match 2 digits and capture in group #1
(?!\1): Make sure we don't have same 2 digits at next position
(: Start capture group #2
(\d): Match and capture a digit in capture group #3
(?!\3): Make sure we don't have same digit at next position as in 3rd capture group
\d: Match a digit
)`: End capture group #2
\2{2}: Match 2 pairs of same value as in capture group #2
$: End

Fetching respective group values in a regex expression

I have an example string like below:
Handling - Uncrating of 3 crates - USD600 each 7%=126.00 1,800.00
I can have another example string that can be like:
Unpacking/Unremoval fee Zero Rated 100.00
I am trying to access the first set of words and the last number values.
So I want the dict to be
{'Handling - Uncrating of 3 crates - USD600 each':1800.00}
or
{'Unpacking/Unremoval fee':100.00}
There might be strings where none of the above patterns (Zero Rated or something with %) present and I would skip those strings.
To do that, I was regexing the following pattern
pattern = re.search(r'(.*)Zero.*Rated\s*(\S*)',line.strip())
and then
pattern.group(1)
gives the keys for dict and
pattern.group(2)
gives the value of 1800.00. This works for lines where Zero Rated is present.
However if I want to also check for pattern where Zero Rated is not present but % is present as in first example above, I was trying to use | but it didn't work.
pattern = re.search(r'(.*)Zero.*Rated|%\s*(\S*)',line.strip())
But this time I am not getting the right pattern groups as it is fetching groups.
Sites like regex101.com can help debug regexes.
In this case, the problem is with operator precedence; the | operates over the whole of the rest of the regex. You can group parts of the regex without creating additional groups with (?: )
Try: r'(.*)(?:Zero.*Rated|%)\s*(\S*)'
Definitely give regex101.com a go, though, it'll show you what's going on in the regex.
You might use
^(.+?)\s*(?:Zero Rated|\d+%=\d{1,3}(?:\,\d{3})*\.\d{2})\s*(\d{1,3}(?:,\d{3})*\.\d{2})
The pattern matches
^ Start of string
(.+?) Capture group 1, match any char except a newline as least as possible
\s* Match 0+ whitespace chars
(?: Non capture group
Zero Rated Match literally
| Or
\d+%= Match 1+ digits and %=
\d{1,3}(?:\,\d{3})*\.\d{2} Match a digit format of 1-3 digits, optionally repeated by a comma and 3 digits followed by a dot and 2 digits
) Close non capture group
\s* Match 0+ whitespace chars
(\d{1,3}(?:,\d{3})*\.\d{2}) Capture group 2, match the digit format
Regex demo | Python demo
For example
import re
regex = r"^(.+?)\s*(?:Zero Rated|\d+%=\d{1,3}(?:\,\d{3})*\.\d{2})\s*(\d{1,3}(?:,\d{3})*\.\d{2})"
test_str = ("Handling - Uncrating of 3 crates - USD600 each 7%=126.00 1,800.00\n"
"Unpacking/Unremoval fee Zero Rated 100.00\n"
"Delivery Cartage - IT Equipment, up to 1000kgs - 7%=210.00 3,000.00")
print(dict(re.findall(regex, test_str, re.MULTILINE)))
Output
{'Handling - Uncrating of 3 crates - USD600 each': '1,800.00', 'Unpacking/Unremoval fee': '100.00', 'Delivery Cartage - IT Equipment, up to 1000kgs -': '3,000.00'}

Key error when using regex quantifier python

I am trying to capture words following specified stocks in a pandas df. I have several stocks in the format $IBM and am setting a python regex pattern to search each tweet for 3-5 words following the stock if found.
My df called stock_news looks as such:
Word Count
0 $IBM 10
1 $GOOGL 8
etc
pattern = ''
for word in stock_news.Word:
pattern += '{} (\w+\s*\S*){3,5}|'.format(re.escape(word))
However my understanding is that {} should be a quantifier, in my case matching between 3 to 5 times however I receive the following KeyError:
KeyError: '3,5'
I have also tried using rawstrings with r'{} (\w+\s*\S*){3,5}|' but to no avail. I also tried using this pattern on regex101 and it seems to work there but not in my Pycharm IDE. Any help would be appreciated.
Code for finding:
pat = re.compile(pattern, re.I)
for i in tweet_df.Tweets:
for x in pat.findall(i):
print(x)
When you build your pattern, there is an empty alternative left at the end, so your pattern effectively matches any string, every empty space before non-matching texts.
You need to build the pattern like
(?:\$IBM|\$GOOGLE)\s+(\w+(?:\s+\S+){3,5})
You may use
pattern = r'(?:{})\s+(\w+(?:\s+\S+){{3,5}})'.format(
"|".join(map(re.escape, stock_news['Word'])))
Mind that the literal curly braces inside an f-string or a format string must be doubled.
Regex details
(?:\$IBM|\$GOOGLE) - a non-capturing group matching either $IBM or $GOOGLE
\s+ - 1+ whitespaces
(\w+(?:\s+\S+){3,5}) - Capturing group 1 (when using str.findall, only this part will be returned):
\w+ - 1+ word chars
(?:\s+\S+){3,5} - a non-capturing* group matching three, four or five occurrences of 1+ whitespaces followed with 1+ non-whitespace characters
Note that non-capturing groups are meant to group some patterns, or quantify them, without actually allocating any memory buffer for the values they match, so that you could capture only what you need to return/keep.

Python Regex Find match group of range of non digits after hyphen and if range is not present ignore rest of pattern

I'm newer to more advanced regex concepts and am starting to look into look behinds and lookaheads but I'm getting confused and need some guidance. I have a scenario in which I may have several different kind of release zips named something like:
v1.1.2-beta.2.zip
v1.1.2.zip
I want to write a one line regex that can find match groups in both types. For example if file type is the first zip, I would want three match groups that look like:
v1.1.2-beta.2.zip
Group 1: v1.1.2
Group 2: beta
Group 3. 2
or if the second zip one match group:
v1.1.2.zip
Group 1: v1.1.2
This is where things start getting confusing to me as I would assume that the regex would need to assert if the hyphen exists and if does not, only look for the one match group, if not find the other 3.
(v[0-9.]{0,}).([A-Za-z]{0,}).([0-9]).zip
This was the initial regex I wrote witch successfully matches the first type but does not have the conditional. I was thinking about doing something like match group range of non digits after hyphen but can't quite get it to work and don't not know to make it ignore the rest of the pattern and accept just the first group if it doesn't find the hyphen
([\D]{0,}(?=[-]) # Does not work
Can someone point me in the right right direction?
You can use re.findall:
import re
s = ['v1.1.2-beta.2.zip', 'v1.1.2.zip']
final_results = [re.findall('[a-zA-Z]{1}[\d\.]+|(?<=\-)[a-zA-Z]+|\d+(?=\.zip)', i) for i in s]
groupings = ["{}\n{}".format(a, '\n'.join(f'Group {i}: {c}' for i, c in enumerate(b, 1))) for a, b in zip(s, final_results)]
for i in groupings:
print(i)
print('-'*10)
Output:
v1.1.2-beta.2.zip
Group 1: v1.1.2
Group 2: beta
Group 3: 2
----------
v1.1.2.zip
Group 1: v1.1.2.
----------
Note that the result garnered from re.findall is:
[['v1.1.2', 'beta', '2'], ['v1.1.2.']]
Here is how I would approach this using re.search. Note that we don't need lookarounds here; just a fairly complex pattern will do the job.
import re
regex = r"(v\d+(?:\.\d+)*)(?:-(\w+)\.(\d+))?\.zip"
str1 = "v1.1.2-beta.2.zip"
str2 = "v1.1.2.zip"
match = re.search(regex, str1)
print(match.group(1))
print(match.group(2))
print(match.group(3))
print("\n")
match = re.search(regex, str2)
print(match.group(1))
v1.1.2
beta
2
v1.1.2
Demo
If you don't have a ton of experience with regex, providing an explanation of each step probably isn't going to bring you up to speed. I will comment, though, on the use of ?: which appears in some of the parentheses. In that context, ?: tells the regex engine not to capture what is inside. We do this because you only want to capture (up to) three specific things.
We can use the following regex:
(v\d+(?:\.\d+)*)(?:[-]([A-Za-z]+))?((?:\.\d+)*)\.zip
This thus produces three groups: the first one the version, the second is optional: a dash - followed by alphabetical characters, and then an optional sequence of dots followed by numbers, and finally .zip.
If we ignore the \.zip suffix (well I assume this is rather trivial), then there are still three groups:
(v\d+(?:\.\d+)*): a regex group that starts with a v followed by \d+ (one or more digits). Then we have a non-capture group (a group starting with (?:..) that captures \.\d+ a dot followed by a sequence of one or more digits. We repeat such subgroup zero or more times.
(?:[-]([A-Za-z]+))?: a capture group that starts with a hyphen [-] and then one or more [A-Za-z] characters. The capture group is however optional (the ? at the end).
((?:\.\d+)*): a group that again has such \.\d+ non-capture subgroup, so we capture a dot followed by a sequence of digits, and this pattern is repeated zero or more times.
For example:
rgx = re.compile(r'(v\d+(?:\.\d+)*)([-][A-Za-z]+)?((?:\.\d+)*)\.zip')
We then obtain:
>>> rgx.findall('v1.1.2-beta.2.zip')
[('v1.1.2', '-beta', '.2')]
>>> rgx.findall('v1.1.2.zip')
[('v1.1.2', '', '')]

Explain the behavior of this re

I have the following:
>>> re.sub('(..)+?/story','\\g<1>','money/story')
'mey'
>>>
Why is capture group 1 the first letter and last two letters of money and not the first two letters?
The first capture group does not contain m at all. What is being matched by (..)+?/story is oney/story.
The (..)+? matches an even number of characters, so the following is matched (spaced out to make it clearer):
m o n e y / s t o r y
^-^ ^-^
Then the replacement is the first capture group. Something you might not know is that when you have a repeated capture group (in this case (..)+?), then only the last captured group is kept.
To summarise, oney/story is matched, and replaced with ey, so the result is mey.
Because the string money contains 5 letters (odd) not even, it won't even match the first letter m. (..)+? captures two characters and non-greedily repeats the pattern one or more times . Because the repetation quantifier + exists next to the capturing group, it would capture tha last two characters of the match . Now the captured group contains the last two characters of the match done by this (..)+? pattern. So you got ey as the captured string not the first on. So by replacing all the matched characters with the string inside the group index 1 ey will give you mey.
DEMO

Categories

Resources