Confusing with the usage of regex in Python - python

I'm confused with the following three patterns, would someone explain it in more detail?
## IPython with Python 2.7.3
In [62]: re.findall(r'[a-z]*',"f233op")
Out[62]: ['f', '', '', '', 'op', ''] ## why does the last '' come out?
In [63]: re.findall(r'([a-z])*',"f233op")
Out[63]: ['f', '', '', '', 'p', ''] ## why does the character 'o' get lost?
In [64]: re.findall(r'([a-z]*)',"f233op")
Out[64]: ['f', '', '', '', 'op', ''] ## what's the different than line 63 above?

Example 1
re.findall(r'[a-z]*',"f233op")
This pattern is matching zero-or-more instances of lower case alphabet characters. The ZERO-or-more part is key here, since a match of nothing, starting from every index position in the string, is just as valid as a match of f or op. The last empty string returned is the match starting from the end of the string (the position between p and $ (end of string).
Example 2
re.findall(r'([a-z])*',"f233op")
Now you are matching character groups, consisting of a single lower-case alphabet character. The o is no longer returned because this is a greedy search, and the last valid matched group will be returned. So if you changed the string to f233op12fre, the final e would be returned, but no the preceding f or r. Likewise, if you take out the p in your string, you still see that o is returned as a valid match.
Conversely, if you tried to make this regex non-greedy by adding a ? (eg. ([a-z])*?), the returned set of matches would all be empty strings, since a valid match of nothing has a higher precedence of a valid match of something.
Example 3
re.findall(r'([a-z]*)',"f233op")
Nothing is different in the matched characters, but now you are returning character groups instead of raw matches. The output of this regex query will be the same as your first example, but you'll notice that if you add an additional matching group, you will suddenly see the results of each match attempt grouped into tuples:
IN : re.findall(r'([a-z]*)([0-9]*)',"f233op")
OUT: [('f', '233'), ('op', ''), ('', '')]
Contrast this with the same pattern, minus the parenthesis (groups), and you'll see why they are important:
IN : re.findall(r'[a-z]*[0-9]*',"f233op")
OUT: ['f233', 'op', '']
Also...
It can be useful to plug regex patterns like these into regex diagram generators like Regexplained to see how the pattern matching logic works. For example, as an explanation as to why your regex is always returning empty character string matches, take a look at the difference between the patterns [a-z]* and [a-z]+.
Don't forget to check the Python docs for the re library if you get stuck, they actually give a pretty stellar explanation for the standard regex syntax.

You get the final '' because [a-z]* is matching the empty string at the end.
The character 'o' is missing because you have told re.findall to match groups, and each group has a single character. Put another way, you’re doing the equivalent of
m = re.match(r'([a-z])*', 'op')
m.group(1)
which will return 'p', because that’s the last thing captured by the parens (capture group 1).
Again, you’re matching groups, but this time multi-character groups.

Your surprising results are related to the Regular Expression Quantifier *.
Consider:
[a-z]*
Debuggex Demo
Vs:
[a-z]+
Debuggex Demo
Consider as another example that I think is more illustrative of what you are seeing:
>>> re.findall(r'[a-z]*', '123456789')
['', '', '', '', '', '', '', '', '', '']
There are no characters in the set [a-z] in the string 123456789. Yet, since * means 'zero or more', all character positions 'match' by not matching any characters at that position.
For example, assume you just wanted to test if there were any letters in a string, and you use a regex like so:
>>> re.search(r'[a-z]*', '1234')
<_sre.SRE_Match object at 0x1069b6988> # a 'match' is returned, but this is
# probably not what was intended
Now consider:
>>> re.findall(r'[a-z]*', '123abc789')
['', '', '', 'abc', '', '', '', '']
Vs:
>>> re.findall(r'([a-z])*', '123abc789')
['', '', '', 'c', '', '', '', '']
The first pattern is [a-z]*. The part [a-z] is a character class matching a single character in the set a-z unless modified; the addition of * quantifier will greedily match as many characters as possible if more than zero -- hence the match of 'abc' but will also allow zero characters to be a match (or a character outside the character set to match the position since 0 is a match).
The addition of a grouping in ([a-z])* effectively reduces the match in the quantified set back to a single character and the last character matched in the set is returned.
If you want the effect of grouping (say in a more complex pattern) use a non capturing group:
>>> re.findall(r'(?:[a-z])*', '123abc789')
['', '', '', 'abc', '', '', '', '']

In line 63 you're finding all instances of a group (indicated with the parens) of characters of length 1. The * isn't doing much for you here (just causing you to match zero length groups).
In the other examples having the * next to the [a-z], you match adjacent characters of any length.
EDIT
Playing around with this tool may help.

Related

Extract salaries from a list of strings

I'm trying to extract salaries from a list of strings.
I'm using the regex findall() function but it's returning many empty strings as well as the salaries and this is causing me problems later in my code.
sal= '41 000€ à 63 000€ / an' #this is a sample string for which i have errors
regex = ' ?([0-9]* ?[0-9]?[0-9]?[0-9]?)'#this is my regex
re.findall(regex,sal)[0]
#returns '41 000' as expected but:
re.findall(regex,sal)[1]
#returns: ''
#Desired result : '63 000'
#the whole list of matches is like this:
['41 000',
'',
'',
'',
'',
'',
'',
'63 000',
'',
'',
'',
'',
'',
'',
'',
'',
'']
# I would prefer ['41 000','63 000']
Can anyone help?
Thanks
Using re.findall will give you the capturing groups when you use them in your pattern and you are using a group where almost everything is optional giving you the empty strings in the result.
In your pattern you use [0-9]* which would match 0+ times a digit. If there is not limit to the leading digits, you might use [0-9]+ instead to not make it optional.
You might use this pattern with a capturing group:
(?<!\S)([0-9]+(?: [0-9]{1,3})?)€(?!\S)
Regex demo | Python demo
Explanation
(?<!\S) Assert what is on the left is not a non whitespace character
( Capture group
[0-9]+(?: [0-9]{1,3})? match 1+ digits followed by an optional part that matches a space and 1-3 digits
) Close capture group
€ Match literally
(?!\S) Assert what is on the right is not a non whitespace character
Your code might look like:
import re
sal= '41 000€ à 63 000€ / an' #this is a sample string for which i have errors
regex = '(?<!\S)([0-9]+(?: [0-9]{1,3})?)€(?!\S)'
print(re.findall(regex,sal)) # ['41 000', '63 000']

Python RE re.split(), results start with empty string

I have some questions on the split() description/examples from the Python RE documents
If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string:
re.split(r'(\W+)', '...words, words...')
['', '...', 'words', ', ', 'words', '...', '']
In this example there is a capturing group, it matches at the start and end of the string, thus, the result starts and ends with an empty string. Outside of understanding that this happens, I would like to better understand the reasoning. The explanation for this is:
That way, separator components are always found at the same relative
indices within the result list.
Could someone expand on this? Relative to what?
My other query is related to this example:
re.split(r'(\W*)', '...words...')
['', '...', '', '', 'w', '', 'o', '', 'r', '', 'd', '', 's', '...', '', '', '']
\w will match any character that can be used in any word in any language (Flag:unicode), or will be the equivalent of [a-zA-Z0-9_] (Flag:ASCII), \W is the inverse of this. Can someone talk about each of the matches in the example above, explain each (if possible) in terms of what is matched (\B, \U, ...).
Added 29/01/2019:
Apart of what I am after wasn't stated very clear (my bad). In terms of the second example, I am curious about the steps taken to come to the result (how the python re module processed the example). After reading this post on Zero-Length Regex Matches things are clearer, but I would still be interest if anyone can break down the logic up to ['', '...', '', '', 'w', in the results.
What it's trying to say is that when you have a capturing group in the delimiter, and it matches the beginning of the string, the resulting list will always start with the delimiter. Similarly, if it matches at the end of the string, the list will always end with the delimiter.
For consistency, this is true even when the delimiter matches an empty string. The input string is considered to have an empty string before the first character and after the last character, and the delimiter will match these. And then they'll be the first and last elements of the resulting list.
Check this:
>>> re.split('(a)', 'awords')
['', 'a', 'words']
>>> re.split('(w)', 'awords')
['a', 'w', 'ords']
>>> re.split('(o)', 'awords')
['aw', 'o', 'rds']
>>> re.split('(s)', 'awords')
['aword', 's', '']
Always at the second place (index of 1).
On the other hand:
>>> re.split('a', 'awords')
['', 'words']
>>> re.split('w', 'awords')
['a', 'ords']
>>> re.split('s', 'awords')
['aword', '']
Almost the same, only the catching group not inside.

Regex python findall issue

From the test string:
test=text-AB123-12a
test=text-AB123a
I have to extract only 'AB123-12' and 'AB123', but:
re.findall("[A-Z]{0,9}\d{0,5}(?:-\d{0,2}a)?", test)
returns:
['', '', '', '', '', '', '', 'AB123-12a', '']
What are all these extra empty spaces? How do I remove them?
The quantifier {0,n} will match anywhere from 0 to n occurrences of the preceding pattern. Since the two patterns you match allow 0 occurrences, and the third is optional (?) it will match 0-length strings, i.e. every character in your string.
Editing to find a minimum of one and maximum of 9 and 5 for each pattern yields correct results:
>>> test='text-AB123-12a'
>>> import re
>>> re.findall("[A-Z]{1,9}\d{1,5}(?:-\d{0,2}a)?", test)
['AB123-12a']
Without further detail about what exactly the strings you are matching look like, I can't give a better answer.
Your pattern is set to match zero length characters with the lower limits of your character set quantifier set to 0. Simply setting to 1 will produce the results you want:
>>> import re
>>> test = ''' test=text-AB123-12a
... test=text-AB123a'''
>>> re.findall("[A-Z]{1,9}\d{1,5}(?:-\d{0,2}a)?", test)
['AB123-12a', 'AB123']
RegEx tester: http://www.regexpal.com/ says that your pattern string [A-Z]{0,9}\d{0,5}(?:-\d{0,2}a)? can match 0 characters, and therefore matches infinitely.
Check your expression one more time. Python gives you undefined result.
Since all parts of your pattern are optional (your ranges specify zero to N occurences and you are qualifying the group with ?), each position in the string counts as a match and most of those are empty matches.
How to prevent this from happening depends on the exact format of what you are trying to match. Are all those parts of your match really optional?
Since letters or digits are optional at the beginning, you must ensure that there's at least one letter or one digit, otherwise your pattern will match the empty string at each position in the string. You can do it starting your pattern with a lookahead. Example:
re.findall(r'(?=[A-Z0-9])[A-Z]{0,9}\d{0,5}(?:-\d\d?)?(?=a)', test)
In this way the match can start with a letter or with a digit.
I assume that when there's an hyphen, it is followed by at least one digit (otherwise what is the reason of this hyphen?). In other words, I assume that -a isn't possible at the end. (correct me if I'm wrong.)
To exclude the "a" from the match result, I putted it in a lookahead.

Regex one-liner for matching only what comes after a certain word?

I want to extract song names from a list like this: 'some text here, songs: song1, song2, song3, fro: othenkl' and get ['song1', 'song2', 'song3']. So I try to do it in one regex:
result = re.findall('[Ss]ongs?:?.*', 'songs: songname1, songname2,')
print re.findall('(?:(\w+),)*', result[0])
This matches perfectly: ['', '', '', '', '', '', '', 'songname1', '', 'songname2', ''] (except for the empty strings, but nbd.
But I want to do it in one line, so I do the following:
print re.findall('[Ss]ongs?:?(?:(\w+),)*','songs: songname1, songname2,')
But I do not understand why this is unable to capture the same as the two regexes above:
['', 'name1', 'name2']
Is there a way to accomplish this in one line? It would be useful to be concise here. thanks.
You don't need to use re.findall in this case, you better to use re.search to find the sequence of songs then split the result with comma ,. Also you don't need to use character class [Ss] to match the Capitals you can use Ignore case flag (re.I) :
>>> s ='some text here, songs: song1, song2, song3, fro: othenkl'
>>> re.search(r'(?<=songs:)(.+),', s,flags=re.I).group(1).split(',')
[' song1', ' song2', ' song3']
(?<=songs:) is a positive look behind which will makes your regex engine match the strings precede by songs: and (.+), will match the largest string after songs: which follows by comma that is the sequence of your songs.
Also as a more general way instead of specifying comma at the end of your regex you can capture the song names based on this fact that they are followed by this patter \s\w+:.
>>> re.search(r'(?<=songs:)(.+)(?=\s\w+:)', s).group(1).split(',')
[' song1', ' song2', ' song3', '']
No, you can't do it in one pattern with the re module.
What you can do is to use the regex module instead with this pattern:
regex.findall(r'(?:\G(?!\A), |\msongs: )(\w++)(?!:)', s)
Where \G is the position after the previous match, \A the start of the string, \m a word boundary followed by word characters, and ++ a possessive quantifier.

Confusing Behaviour of regex in Python

I'm trying to match a specific pattern using the re module in python.
I wish to match a full sentence (More correctly I would say that they are alphanumeric string sequences separated by spaces and/or punctuation)
Eg.
"This is a regular sentence."
"this is also valid"
"so is This ONE"
I'm tried out of various combinations of regular expressions but I am unable to grasp the working of the patterns properly, with each expression giving me a different yet inexplicable result (I do admit I am a beginner, but still).
I'm tried:
"((\w+)(\s?))*"
To the best of my knowledge this should match one or more alpha alphanumerics greedily followed by either one or no white-space character and then it should match this entire pattern greedily. This is not what it seems to do, so clearly I am wrong but I would like to know why. (I expected this to return the entire sentence as the result)
The result I get for the first sample string mentioned above is [('sentence', 'sentence', ''), ('', '', ''), ('', '', ''), ('', '', '')].
"(\w+ ?)*"
I'm not even sure how this one should work. The official documentation(python help('re')) says that the ,+,? Match x or x (greedy) repetitions of the preceding RE.
In such a case is simply space the preceding RE for '?' or is '\w+ ' the preceding RE? And what will be the RE for the '' operator? The output I get with this is ['sentence'].
Others such as "(\w+\s?)+)" ; "((\w*)(\s??)) etc. which are basically variation of the same idea that the sentence is a set of alpha numerics followed by a single/finite number of white spaces and this pattern is repeated over and over.
Can someone tell me where I go wrong and why, and why the above expressions do not work the way I was expecting them to?
P.S I eventually got "[ \w]+" to work for me but With this I cannot limit the number of white-space characters in continuation.
Your reasoning about the regex is correct, your problem is coming from using capturing groups with *. Here's an alternative:
>>> s="This is a regular sentence."
>>> import re
>>> re.findall(r'\w+\s?', s)
['This ', 'is ', 'a ', 'regular ', 'sentence']
In this case it might make more sense for you to use \b in order to match word boundries.
>>> re.findall(r'\w+\b', s)
['This', 'is', 'a', 'regular', 'sentence']
Alternatively you can match the entire sentence via re.match and use re.group(0) to get the whole match:
>>> r = r"((\w+)(\s?))*"
>>> s = "This is a regular sentence."
>>> import re
>>> m = re.match(r, s)
>>> m.group(0)
'This is a regular sentence'
Here's an awesome Regular Expression tutorial website:
http://regexone.com/
Here's a Regular Expression that will match the examples given:
([a-zA-Z0-9,\. ]+)
Why do you want to limit the number of white space character in continuation? Because a sentence can have any number of words (sequences of alphanumeric characters) and spaces in a row, but rather a sentence is the area of text that ends with a punctuation mark or rather something that is not in the above sequence including white space.
([a-zA-Z0-9\s])*
The above regex will match a sentence wherein it is a series or spaces in series zero or more times. You can refine it to be the following though:
([a-zA-Z0-9])([a-zA-Z0-9\s])*
Which simply states that the above sequence must be prefaced with a alphanumeric character.
Hope this is what you were looking for.
Maybe this will help:
import re
source = """
This is a regular sentence.
this is also valid
so is This ONE
how about this one followed by this one
"""
re_sentence = re.compile(r'[^ \n.].*?(\.|\n| +)')
def main():
i = 0
for s in re_sentence.finditer(source):
print "%d:%s" % (i, s.group(0))
i += 1
if __name__ == '__main__':
main()
I am using alternation in the expression (\.|\n| +) to describe the end-of-sentence condition. Note the use of two spaces in the third alternation. The second space has the '+' meta-character so that two or more spaces in a row will be an end-of-sentence.

Categories

Resources