Exclude matched string python re.findall - python

I am using python's re.findall method to find occurrence of certain string value in Input string.
e.g. From search in 'ABCdef' string, I have two search requirements.
Find string starting from Single Capital letter.
After 1 find string that contains all capital letter.
e.g. input string and expected output will be:
'USA' -- output: ['USA']
'BObama' -- output: ['B', 'Obama']
'Institute20CSE' -- output: ['Institute', '20', 'CSE']
So My expectation from
>>> matched_value_list = re.findall ( '[A-Z][a-z]+|[A-Z]+' , 'ABCdef' )
is to return ['AB', 'Cdef'].
But which does Not seems to be happening. What I get is ['ABC'] as return value, which matches later part of regex with full string.
So Is there any way we can ignore found matches. So that once 'Cdef' is matched with '[A-Z][a-z]+'. second part of regex (i.e. '[A-Z]+') only matches with remaining string 'AB'?

First you need to match AB, which is followed by an Uppercase alphabet and then a lowercase alphabet. or is at the end of the string. For that you can use look-ahead.
Then you need to match an Uppercase alphabet C, followed by multiple lowercase alphabets def.
So, you can use this pattern:
>>> s = "ABCdef"
>>> re.findall("([A-Z]+(?=[A-Z][a-z]|$)|[A-Z][a-z]+)", s)
['AB', 'Cdef']
>>> re.findall("([A-Z]+(?=[A-Z][a-z]|$)|[A-Z][a-z]+)", 'MumABXYZCdefXYZAbc')
['Mum', 'ABXYZ', 'Cdef', 'XYZ', 'Abc']
As pointed out in comment by #sotapme, you can also modify the above regex to: -
"([A-Z]+(?=[A-Z]|$)|[A-Z][a-z]+|\d+)"
Added \d+ since you also want to match digit as in one of your example. Also, he removed [a-z] part from the first part of look-ahead. That works because, + quantifier on the [A-Z] outside is greedy by default, so, it will automatically match maximum string, and will stop only before the last upper case alphabet.

You can use this regex
[A-Z][a-zA-Z]*?(?=[A-Z][a-z]|[^a-zA-Z]|$)

Related

Regex python findall issue

From the test string:
test=text-AB123-12a
test=text-AB123a
I have to extract only 'AB123-12' and 'AB123', but:
re.findall("[A-Z]{0,9}\d{0,5}(?:-\d{0,2}a)?", test)
returns:
['', '', '', '', '', '', '', 'AB123-12a', '']
What are all these extra empty spaces? How do I remove them?
The quantifier {0,n} will match anywhere from 0 to n occurrences of the preceding pattern. Since the two patterns you match allow 0 occurrences, and the third is optional (?) it will match 0-length strings, i.e. every character in your string.
Editing to find a minimum of one and maximum of 9 and 5 for each pattern yields correct results:
>>> test='text-AB123-12a'
>>> import re
>>> re.findall("[A-Z]{1,9}\d{1,5}(?:-\d{0,2}a)?", test)
['AB123-12a']
Without further detail about what exactly the strings you are matching look like, I can't give a better answer.
Your pattern is set to match zero length characters with the lower limits of your character set quantifier set to 0. Simply setting to 1 will produce the results you want:
>>> import re
>>> test = ''' test=text-AB123-12a
... test=text-AB123a'''
>>> re.findall("[A-Z]{1,9}\d{1,5}(?:-\d{0,2}a)?", test)
['AB123-12a', 'AB123']
RegEx tester: http://www.regexpal.com/ says that your pattern string [A-Z]{0,9}\d{0,5}(?:-\d{0,2}a)? can match 0 characters, and therefore matches infinitely.
Check your expression one more time. Python gives you undefined result.
Since all parts of your pattern are optional (your ranges specify zero to N occurences and you are qualifying the group with ?), each position in the string counts as a match and most of those are empty matches.
How to prevent this from happening depends on the exact format of what you are trying to match. Are all those parts of your match really optional?
Since letters or digits are optional at the beginning, you must ensure that there's at least one letter or one digit, otherwise your pattern will match the empty string at each position in the string. You can do it starting your pattern with a lookahead. Example:
re.findall(r'(?=[A-Z0-9])[A-Z]{0,9}\d{0,5}(?:-\d\d?)?(?=a)', test)
In this way the match can start with a letter or with a digit.
I assume that when there's an hyphen, it is followed by at least one digit (otherwise what is the reason of this hyphen?). In other words, I assume that -a isn't possible at the end. (correct me if I'm wrong.)
To exclude the "a" from the match result, I putted it in a lookahead.

How to change a quantifier in a Regex based on a condition?

I would like to find words of length >= 1 which may contain a ' or a - within. Here is a test string:
a quake-prone area- (aujourd'hui-
In Python, I'm currently using this regex:
string = "a quake-prone area- (aujourd'hui-"
RE_WORDS = re.compile(r'[a-z]+[-\']?[a-z]+')
words = RE_WORDS.findall(string)
I would like to get this result:
>>> words
>>> [u'a', u'quake-prone', u'area', u"aujourd'hui"]
but I get this instead:
>>> words
>>> [u'quake-prone', u'area', u"aujourd'hui"]
Unfortunately, because of the last + quantifier, it skips all words of length 1. If I use the * quantifier, it will find a but also area- instead of area.
Then how could create a conditional regex saying: if the word contains an apostrophe or an hyphen, use the + quantifier else use the * quantifier ?
I suggest you to change the last [-\']?[a-z]+ part as optional by putting it into a group and then adding a ? quantifier next to that group.
>>> string = "a quake-prone area- (aujourd'hui-"
>>> RE_WORDS = re.compile(r'[a-z]+(?:[-\'][a-z]+)?')
>>> RE_WORDS.findall(string)
['a', 'quake-prone', 'area', "aujourd'hui"]
Reason for why the a is not printed is because of your regex contains two [a-z]+ which asserts that there must be atleast two lowercase letters present in the match.
Note that the regex i mentioned won't match area- because (?:[-\'][a-z]+)? optional group asserts that there must be atleast one lowercase letter would present just after to the - symbol. If no, then stop matching until it reaches the hyphen. So that you got area at the output instead of area- because there isn't an lowercase letter exists next to the -. Here it stops matching until it finds an hyphen without following lowercase letter.

Python Alphanumeric Regex

Below I have the following regex:
alphanumeric = compile('^[\w\d ]+$')
I'm running the current data against this regex:
Tomkiewicz Zigomalas Andrade Mcwalters
I have a separate regex to identify alpha characters only, yet the data above still matches the alphanumeric criteria.
Edit: How do I stop the only alpha data matching with the regex above?
Description: It can be in two forms:
Starts with numeric chars then there should be some chars, followed by any number of alpha-numeric chars are possible.
Starts with alphabets, then some numbers, followed by any number of alpha-numeric chars are possible.
Demo:
>>> an_re = r"(\d+[A-Z])|([A-Z]+\d)[\dA-Z]*"
>>> re.search(an_re, '12345', re.I) # not acceptable string
>>> re.search(an_re, 'abcd', re.I) # not acceptable string
>>> re.search(an_re, 'abc1', re.I) # acceptable string
<_sre.SRE_Match object at 0x14153e8>
>>> re.search(an_re, '1abc', re.I)
<_sre.SRE_Match object at 0x14153e8>
Use a lookahead to assert the condition that at least one alpha and at least one digit are present:
(?=.*[a-zA-Z])(?=.*[0-9])^[\w\d ]+$
The above RegEx utilizes two lookaheads to first check the entire string for each condition. The lookaheads search up until a single character in the specified range is found. If the assertion matches then it moves on to the next one. The last part I borrowed from the OP's original attempt and just ensures that the entire string is composed of one or more lower/upper alphas, underscores, digits, or spaces.

Finding all possible substrings within a string. Python Regex

I want to find all possible substrings inside a string with the following requirement: The substring starts with N, the next letter is anything but P, and the next letter is S or T
With the test string "NNSTL", I would like to get as results "NNS" and "NST"
Is this possible with Regex?
Try the following regex:
N[^P\W\d_][ST]
The first character is N, the next character is none of (^) P, a non-letter (\W), a digit (\d) or underscore (_). The last letter is either S or T. I'm assuming the second character must be a letter.
EDIT
The above regex will only match the first instance in the string "NNSTL" because it will then start the next potential match at position 3: "TL". If you truly want both results at the same time use the following:
(?=(N[^P\W\d_][ST])).
The substring will be in group 1 instead of the whole pattern match which will only be the first character.
You can do this with the re module:
import re
Here's a possible search string:
my_txt = 'NfT foo NxS bar baz NPT'
So we use the regular expression that first looks for an N, any character other than a P, and a character that is either an S or a T.
regex = 'N[^P][ST]'
and using re.findall:
found = re.findall(regex, my_txt)
and found returns:
['NfT', 'NxS']
Yes. The regex snippet is: "N[^P][ST]"
Plug it in to any regex module methods from here: http://docs.python.org/2/library/re.html
Explanation:
N matches a literal "N".
[^P] is a set, where the caret ("^") denotes inverse (so, it matches anything not in the set.
[ST] is another set, where it matches either an "S" or a "T".

How should I write this regex in python

I have the string.
st = "12345 hai how r u #3456? Awer12345 7890"
re.findall('([0-9]+)',st)
It should not come like :
['12345', '3456', '12345', '7890']
I should get
['12345','7890']
I should only take the numeric values
and
It should not contain any other chars like alphabets,special chars
No need to use a regular expression:
[i for i in st.split(" ") if i.isdigit()]
Which I think is much more readable than using a regex
Corey's solution is really the right way to go here, but since the question did ask for regex, here is a regex solution that I think is simpler than the others:
re.findall(r'(?<!\S)\d+(?!\S)', st)
And an explanation:
(?<!\S) # Fail if the previous character (if one exists) isn't whitespace
\d+ # Match one or more digits
(?!\S) # Fail if the next character (if one exists) isn't whitespace
Some examples:
>>> re.findall(r'(?<!\S)\d+(?!\S)', '12345 hai how r u #3456? Awer12345 7890')
['12345', '7890']
>>> re.findall(r'(?<!\S)\d+(?!\S)', '12345 hai how r u #3456? Awer12345 7890123ER%345 234 456 789')
['12345', '234', '456', '789']
In [21]: re.findall(r'(?:^|\s)(\d+)(?=$|\s)', st)
Out[21]: ['12345', '7890']
Here,
(?:^|\s) is a non-capture group that matches the start of the string, or a space.
(\d+) is a capture group that matches one or more digits.
(?=$|\s) is lookahead assertion that matches the end of the string, or a space, without consuming it.
use this: (^|\s)[0-9]+(\s|$) pattern. (^|\s) means that your number must be at the start of the string or there must be a whitespace character before the number. And (\s|$) means that there must be a whitespace after number or the number is at the end of the string.
As said Jan Pöschko, 456 won't be found in 123 456. If your "bad" parts (#, Awer) are always prefixes, you can use this (^|\s)[0-9]+ pattern and everything will be OK. It will match all numbers, which have only whitespaces before or are at the start of the string. Hope this helped...
Your expression finds all sequences of digits, regardless of what surrounds them. You need to include a specification of what comes before and after the sequence to get the behavior you want:
re.findall(r"[\D\b](\d+)[\D\b]", st)
will do what you want. In English, it says "match all sequences of one or more digits that are surrounded by a non-digit character.or a word boundary"

Categories

Resources