Regex python findall issue - python

From the test string:
test=text-AB123-12a
test=text-AB123a
I have to extract only 'AB123-12' and 'AB123', but:
re.findall("[A-Z]{0,9}\d{0,5}(?:-\d{0,2}a)?", test)
returns:
['', '', '', '', '', '', '', 'AB123-12a', '']
What are all these extra empty spaces? How do I remove them?

The quantifier {0,n} will match anywhere from 0 to n occurrences of the preceding pattern. Since the two patterns you match allow 0 occurrences, and the third is optional (?) it will match 0-length strings, i.e. every character in your string.
Editing to find a minimum of one and maximum of 9 and 5 for each pattern yields correct results:
>>> test='text-AB123-12a'
>>> import re
>>> re.findall("[A-Z]{1,9}\d{1,5}(?:-\d{0,2}a)?", test)
['AB123-12a']
Without further detail about what exactly the strings you are matching look like, I can't give a better answer.

Your pattern is set to match zero length characters with the lower limits of your character set quantifier set to 0. Simply setting to 1 will produce the results you want:
>>> import re
>>> test = ''' test=text-AB123-12a
... test=text-AB123a'''
>>> re.findall("[A-Z]{1,9}\d{1,5}(?:-\d{0,2}a)?", test)
['AB123-12a', 'AB123']

RegEx tester: http://www.regexpal.com/ says that your pattern string [A-Z]{0,9}\d{0,5}(?:-\d{0,2}a)? can match 0 characters, and therefore matches infinitely.
Check your expression one more time. Python gives you undefined result.

Since all parts of your pattern are optional (your ranges specify zero to N occurences and you are qualifying the group with ?), each position in the string counts as a match and most of those are empty matches.
How to prevent this from happening depends on the exact format of what you are trying to match. Are all those parts of your match really optional?

Since letters or digits are optional at the beginning, you must ensure that there's at least one letter or one digit, otherwise your pattern will match the empty string at each position in the string. You can do it starting your pattern with a lookahead. Example:
re.findall(r'(?=[A-Z0-9])[A-Z]{0,9}\d{0,5}(?:-\d\d?)?(?=a)', test)
In this way the match can start with a letter or with a digit.
I assume that when there's an hyphen, it is followed by at least one digit (otherwise what is the reason of this hyphen?). In other words, I assume that -a isn't possible at the end. (correct me if I'm wrong.)
To exclude the "a" from the match result, I putted it in a lookahead.

Related

Parsing based on pattern not at the beginning

I want to extract the number before "2022" in a set of strings possibly. I current do
a= mystring.strip().split("2022")[0]
and, for instance, when mystring=' 1020220519AX', this gives a = '10'. However,
mystring.strip().split("2022")[0]
fails when mystring=' 20220220519AX' to return a='202'. Therefore, I want the code to split the string on "2022" that is not at the beginning non-whitespace characters in the string.
Can you please guide with this?
Use a regular expression rather than split().
import re
mystring = ' 20220220519AX'
match = re.search(r'^\s*(\d+?)2022', mystring)
if match:
print(match.group(1))
^\s* skips over the whitespace at the beginning, then (\d+?) captures the following digits up to the first 2022.
You can tell a regex engine that you want all the digits before 2022:
r'\d+(?=2022)'
Like .split(), a regex engine is 'greedy' by default - 'greedy' here means that as soon as it can take something that it is instructed to take, it will take that and it won't try another option, unless the rest of the expression cannot be made to work.
So, in your case, mystring.strip().split("2022") splits on the first 2020 it can find and since there's nothing stopping it, that is the result you have to work with.
Using regex, you can even tell it you're not interested in the 2022, but in the numbers before it: the \d+ will match as long a string of digits it can find (greedy), but the (?=2022) part says it must be followed by a literal 2022 to be a match (and that won't be part of the match, a 'positive lookahead').
Using something like:
import re
mystring = ' 20220220519AX'
print(re.findall(r'\d+(?=2022)', mystring))
Will show you all consecutive matches.
Note that for a string like ' 920220220519AX 12022', it will find ['9202', '1'] and only that - it won't find all possible combinations of matches. The first, greedy pass through the string that succeeds is the answer you get.
You could split() asserting not the start of the string to the left after using strip(), or you can get the first occurrence of 1 or more digits from the start of the string, in case there are more occurrences of 2022
import re
strings = [
' 1020220519AX',
' 20220220519AX'
]
for s in strings:
parts = re.split(r"(?<!^)2022", s.strip())
if parts:
print(parts[0])
for s in strings:
m = re.match(r"\s*(\d+?)2022", s)
if m:
print(m.group(1))
Both will output
10
202
Note that the split variant does not guarantee that the first part consists of digits, it is only splitted.
If the string consists of only word characters, splitting on \B2022 where \B means non a word boundary, will also prevent splitting at the start of the example string.

unexpected re.sub behavior

I defined
s='f(x) has an occ of x but no y'
def italicize_math(line):
p="(\W|^)(x|y|z|f|g|h)(\W|$)"
repl=r"\1<i>\2</i>\3"
return re.sub(p,repl,line)
and made the following call:
print(italicize_math(s)
The result is
'<i>f</i>(x) has an occ of <i>x</i> but no <i>y</i>'
which is not what I expected. I wanted this instead:
'<i>f</i>(<i>x</i>) has an occ of <i>x</i> but no <i>y</i>'
Can anyone tell me why the first occurence of x was not enclosed in inside the "i" tags?
You seem to be trying to match non-alphanumeric characters (\W) when you really want a word boundary (\b):
>>> p=r"(\b)(x|y|z|f|g|h)(\b)"
>>> re.sub(p,repl,s)
'<i>f</i>(<i>x</i>) has an occ of <i>x</i> but no <i>y</i>'
Of course, ( is non alpha-numeric -- The reason your inner content doesn't match is because \W consumes a character in the match. so with a string like 'f(x)', you match the ( when you match f. Since ( was already matched, it won't match again when you try to match x. By contrast, word boundaries don't consume any characters.
Because the group construct is matching the position at the beginning of the string first and x would overlap the previous match. Also, the first and third groups are redundant since they can be replaced by word boundaries; and you can make use of a character class to combine letters.
p = r'\b([fghxyz])\b'
repl = r'<i>\1</i>'
Like previous answer mention, its because the ( char being consume when matching f thus cause subsequent x to fail the match.
beside replace with word boundary \b, you could also use lookahead regex which just do a peek and won't consume anything match inside the lookahead. Since it didn't consume anything, you don't need the \3 either
p=r"(\W|^)(x|y|z|f|g|h)(?=\W|$)"
repl=r"\1<i>\2</i>"
re.sub(p,repl,line)

Python regex: greedy pattern returning multiple empty matches

This pattern is meant simply to grab everything in a string up until the first potential sentence boundary in the data:
[^\.?!\r\n]*
Output:
>>> pattern = re.compile(r"([^\.?!\r\n]*)")
>>> matches = pattern.findall("Australians go hard!!!") # Actual source snippet, not a personal comment about Australians. :-)
>>> print matches
['Australians go hard', '', '', '', '']
From the Python documentation:
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found. If one or more groups are present in the pattern,
return a list of groups; this will be a list of tuples if the pattern
has more than one group. Empty matches are included in the result
unless they touch the beginning of another match.
Now, if the string is scanned left to right and the * operator is greedy, it makes perfect sense that the first match returned is the whole string up to the exclamation marks. However, after that portion has been consumed, I do not see how the pattern is producing an empty match exactly four times, presumably by scanning the string leftward after the "d". I do understand that the * operator means this pattern can match the empty string, I just don't see how it would doing that more than once between the trailing "d" of the letters and the leading "!" of the punctuation.
Adding the ^ anchor has this effect:
>>> pattern = re.compile(r"^([^\.?!\r\n]*)")
>>> matches = pattern.findall("Australians go hard!!!")
>>> print matches
['Australians go hard']
Since this eliminates the empty string matches, it would seem to indicate that said empty matches were occurring before the leading "A" of the string. But that would seem to contradict the documentation with respect to the matches being returned in the order found (matches before the leading "A" should have been first) and, again, exactly four empty matches baffles me.
The * quantifier allows the pattern to capture a substring of length zero. In your original code version (without the ^ anchor in front), the additional matches are:
the zero-length string between the end of hard and the first !
the zero-length string between the first and second !
the zero-length string between the second and third !
the zero-length string between the third ! and the end of the text
You can slice/dice this further if you like here.
Adding that ^ anchor to the front now ensures that only a single substring can match the pattern, since the beginning of the input text occurs exactly once.

Exclude matched string python re.findall

I am using python's re.findall method to find occurrence of certain string value in Input string.
e.g. From search in 'ABCdef' string, I have two search requirements.
Find string starting from Single Capital letter.
After 1 find string that contains all capital letter.
e.g. input string and expected output will be:
'USA' -- output: ['USA']
'BObama' -- output: ['B', 'Obama']
'Institute20CSE' -- output: ['Institute', '20', 'CSE']
So My expectation from
>>> matched_value_list = re.findall ( '[A-Z][a-z]+|[A-Z]+' , 'ABCdef' )
is to return ['AB', 'Cdef'].
But which does Not seems to be happening. What I get is ['ABC'] as return value, which matches later part of regex with full string.
So Is there any way we can ignore found matches. So that once 'Cdef' is matched with '[A-Z][a-z]+'. second part of regex (i.e. '[A-Z]+') only matches with remaining string 'AB'?
First you need to match AB, which is followed by an Uppercase alphabet and then a lowercase alphabet. or is at the end of the string. For that you can use look-ahead.
Then you need to match an Uppercase alphabet C, followed by multiple lowercase alphabets def.
So, you can use this pattern:
>>> s = "ABCdef"
>>> re.findall("([A-Z]+(?=[A-Z][a-z]|$)|[A-Z][a-z]+)", s)
['AB', 'Cdef']
>>> re.findall("([A-Z]+(?=[A-Z][a-z]|$)|[A-Z][a-z]+)", 'MumABXYZCdefXYZAbc')
['Mum', 'ABXYZ', 'Cdef', 'XYZ', 'Abc']
As pointed out in comment by #sotapme, you can also modify the above regex to: -
"([A-Z]+(?=[A-Z]|$)|[A-Z][a-z]+|\d+)"
Added \d+ since you also want to match digit as in one of your example. Also, he removed [a-z] part from the first part of look-ahead. That works because, + quantifier on the [A-Z] outside is greedy by default, so, it will automatically match maximum string, and will stop only before the last upper case alphabet.
You can use this regex
[A-Z][a-zA-Z]*?(?=[A-Z][a-z]|[^a-zA-Z]|$)

python regular expression numbers in a row

I'm trying to check a string for a maximum of 3 numbers in a row for which I used:
regex = re.compile("\d{0,3}")
but this does not work for instance the string 1234 would be accepted by this regex even though the digit string if over length 3.
If you want to check a string for a maximum of 3 digits in string you need to use '\d{4,}' as you are only interest in the digits string over a length of 3.
import re
str='123abc1234def12'
print re.findall('\d{4,}',str)
>>> '[1234]'
If you use {0,3}:
str='123456'
print re.findall('\d{0,3}',str)
>>> ['123', '456', '']
The regex matches digit strings of maximum length 3 and empty strings but this cannot be used to test correctness. Here you can't check whether all digit strings are in length but you can easily check for digits string over the length.
So to test do something like this:
str='1234'
if re.match('\d{4,}',str):
print 'Max digit string too long!'
>>> Max digit string too long!
\d{0} matches every possible string. It's not clear what you mean by "doesn't work", but if you expect to match a string with digits, increase the repetition operator to {1,3}.
If you wish to exclude runs of 4 or more, try something like (?:^|\D)\d{1,3}(?:\D|$) and of course, if you want to capture the match, use capturing parentheses around \d{1,3}.
The method you have used is to find substrings with 0-3 numbers, it couldn't reach your expactation.
My solve:
>>> import re
>>> re.findall('\d','ds1hg2jh4jh5')
['1', '2', '4', '5']
>>> res = re.findall('\d','ds1hg2jh4jh5')
>>> len(res)
4
>>> res = re.findall('\d','23425')
>>> len(res)
5
so,next you just need use ‘if’ to judge the numbers of digits.
There could be a couple reasons:
Since you want \d to search for digits or numbers, you should probably spell that as "\\d" or r"\d". "\d" might happen to work, but only because d isn't special (yet) in a string. "\n" or "\f" or "\r" will do something totally different. Check out the re module documentation and search for "raw strings".
"\\d{0,3}" will match just about anything, because {0,3} means "zero or up to three". So, it will match the start of any string, since any string starts with the empty string.
or, perhaps you want to be searching for strings that are only zero to three numbers, and nothing else. In this case, you want to use something like r"^\d{0,3}$". The reason is that regular expressions match anywhere in a string (or only at the beginning if you are using re.match and not re.search). ^ matches the start of the string, and $ matches the end, so by putting those at each end you are not matching anything that has anything before or after \d{0,3}.

Categories

Resources