Regex expression for a given string - python

I have a small issue i am running into. I need a regular expression that would split a passed string with numbers separately and anything chunk of characters within square brackets separately and regular set of string separately.
for example if I have a strings that resembles
s = 2[abc]3[cd]ef
i need a list with lst = ['2','abc','3','cd','ef']
I have a code so far that has this..
import re
s = "2[abc]3[cd]ef"
s_final = ""
res = re.findall("(\d+)\[([^[\]]*)\]", s)
print(res)
This is outputting a list of tuples that looks like this.
[('2', 'abc'), ('3', 'cd')]
I am very new to regular expression and learning.. Sorry if this is an easy one.
Thanks!

The immediate fix is getting rid of the capturing groups and using alternation to match either digits or chars other than square bracket chars:
import re
s = "2[abc]3[cd]ef"
res = re.findall(r"\d+|[^][]+", s)
print(res)
# => ['2', 'abc', '3', 'cd', 'ef']
See the regex demo and the Python demo. Details:
\d+ - one or more digits
| - or
[^][]+ - one or more chars other than [ and ]
Other solutions that might help are:
re.findall(r'\w+', s)
re.findall(r'\d+|[^\W\d_]+', s)
where \w+ matches one or more letters, digits, underscores and some more connector punctuation with diacritics and [^\W\d_]+ matches any one or more Unicode letters.
See this Python demo.

Don't try a regex that will find all part in the string, but rather a regex that is able to match each block, and \w (meaning [a-zA-Z0-9_]) feats well
s = "2[abc]3[cd]ef"
print(re.findall(r"\w+", s)) # ['2', 'abc', '3', 'cd', 'ef']
Or split on brackets
print(re.split(r"[\[\]]", s)) # ['2', 'abc', '3', 'cd', 'ef ']

Regex is intended to be used as a Regular Expression, your string is Irregular.
regex is being mostly used to find a specific pattern in a long text, text validation, extract things from text.
for example, in order to find a phone number in a string, I would use RegEx, but when I want to build a calculator and I need to extract operators/digits I would not, but I would rather want to write a python code to do that.

Related

Python regex, capture groups that are not specific

Consider the following example strings:
abc1235abc53abcXX
123abc098YXabc
I want to capture the groups that occur between the abc,
e.g. I should get the following groups:
1235, 53, XX
123, 098YX
I'm trying this regex, but somehow it does not capture the in-between text:
(abc(.*?))+
What am I doing wrong?
EDIT: I need to do it using regex, no string splitting, since I need to apply further rules on the captured groups.
re.findall() approach with specific regex pattern:
import re
strings = ['abc1235abc53abcXX', '123abc098YXabc']
pat = re.compile(r'(?:abc|^)(.+?)(?=abc|$)') # prepared pattern
for s in strings:
items = pat.findall(s)
print(items)
# further processing
The output:
['1235', '53', 'XX']
['123', '098YX']
(?:abc|^) - non-captured group to match either abc substring OR start of the string ^
(.+?) - captured group to match any character sequence as few times as possible
(?=abc|$) - lookahead positive assertion, ensures that the previous matched item is followed by either abc sequence OR end of the string $
Use re.split:
import re
s = 'abc1235abc53abcXX'
re.split('abc', s)
# ['', '1235', '53', 'XX']
Note that you get an empty string, representing the match before the first 'abc'.
Try splitting the string by abc and then remove the empty results by using if statement inside list comprehension as below:
[r for r in re.split('abc', s) if r]

Split a line with matching regex special characters

Currently I want to split a line with all the matching special characters of the regex. As it is hard to explain, here are a few examples:
('.+abcd[0-9]+\.mp3', 'Aabcd09.mp3') -> [ 'A', '09' ]
.+ is a special expression of the regex and this is the match that I want
[0-9]+ is another regex expression and I want what it matches too
('.+\..+_[0-9]+\.mp3', 'A.abcd_09.mp3') -> [ 'A', 'abcd', '09' ]
.+ is the first special expression of the regex, it matches A
.+ is the second special expression of the regex, it matches abcd
[0-9]+ is the third special expression of the regex, it matches 09
Do you know how to achieve this? I didn't find anything.
Looks like you need a so called tokenizer/lexer to parse a regular expression first. It will allow you to split a base regex on sub-expressions. Then just apply these sub-expressions to the original string and print out matches.
You can try this:
import re
s = ['Aabcd09.mp3', 'A.abcd_09.mp3']
new_s = [re.findall('(?<=^)[a-zA-Z]|(?<=\.)[a-zA-Z]+(?=_)|\d+(?=\.mp3)', i) for i in s]
Output:
[['A', '09'], ['A', 'abcd', '09']]

Split string by position not character

We know that anchors, word boundaries, and lookaround match at a position, rather than matching a character.
Is it possible to split a string by one of the preceding ways with regex (specifically in python)?
For example consider the following string:
"ThisisAtestForchEck,Match IngwithPosition."
So i want the following result (the sub-strings that start with uppercase letter but not precede by space ):
['Thisis', 'Atest', 'Forch' ,'Eck,' ,'Match Ingwith' ,'Position.']
If i split with grouping i get:
>>> re.split(r'([A-Z])',s)
['', 'T', 'hisis', 'A', 'test', 'F', 'orch', 'E', 'ck,', 'M', 'atchingwith', 'P', 'osition.']
And this is the result with look-around :
>>> re.split(r'(?<=[A-Z])',s)
['ThisisAtestForchEck,MatchingwithPosition.']
>>> re.split(r'((?<=[A-Z]))',s)
['ThisisAtestForchEck,MatchingwithPosition.']
>>> re.split(r'((?<=[A-Z])?)',s)
['ThisisAtestForchEck,MatchingwithPosition.']
Note that if i want to split by sub-strings that start with uppercase and are preceded by a space, e.g.:
['Thisis', 'Atest', 'Forch' ,'Eck,' ,'Match ', Ingwith' ,'Position.']
I can use re.findall, viz.:
>>> re.findall(r'([A-Z][^A-Z]*)',s)
['Thisis', 'Atest', 'Forch', 'Eck,', 'Match ', 'Ingwith', 'Position.']
But what about the first example: is it possible to solve it with re.findall?
A way with re.findall:
re.findall(r'(?:[A-Z]|^[^A-Z\s])[^A-Z\s]*(?:\s+[A-Z][^A-Z]*)*',s)
When you decide to change your approach from split to findall, the first job consists to reformulate your requirements: "I want to split the string on each uppercase letter non preceded by a space" => "I want to find one or more substrings separed by space that begins with an uppercase letter except from the start of the string (if the string doesn't start with an uppercase letter)"
(?<!\s)(?=[A-Z])
You can use this to split with regex module as re does not support split at 0 width assertions.
import regex
x="ThisisAtestForchEck,Match IngwithPosition."
print regex.split(r"(?<![\s])(?=[A-Z])",x,flags=regex.VERSION1)
or
print [i for i in regex.split(r"(?<![\s])(?=[A-Z])",x,flags=regex.VERSION1) if i]
See demo.
https://regex101.com/r/sJ9gM7/65
I know this might be less convenient because of the tuple nature of the result. But I think that this findall finds what you need:
re.findall(r'((?<!\s)[A-Z]([^A-Z]|(?<=\s)[A-Z])*)', s)
## returns [('Thisis', 's'), ('Atest', 't'), ('Forch', 'h'), ('Eck,', ','), ('Match Ingwith', 'h'), ('Position.', '.')]
This can be used in the following list comprehension to give the desired output:
[val[0] for val in re.findall(r'((?<!\s)[A-Z]([^A-Z]|(?<=\s)[A-Z])*)', s)]
## returns ['Thisis', 'Atest', 'Forch', 'Eck,', 'Match Ingwith', 'Position.']
And here is a hack that uses split:
re.split(r'((?<!\s)[A-Z]([^A-Z]|(?<=\s)[A-Z])*)', s)[1::3]
## returns ['Thisis', 'Atest', 'Forch', 'Eck,', 'Match Ingwith', 'Position.']
try capture using this pattern
([A-Z][a-z]*(?: [A-Z][a-z]*)*)
Demo

re.findall not returning full match?

I have a file that includes a bunch of strings like "size=XXX;". I am trying Python's re module for the first time and am a bit mystified by the following behavior: if I use a pipe for 'or' in a regular expression, I only see that bit of the match returned. E.g.:
>>> myfile = open('testfile.txt', 'r').read()
>>> re.findall('size=50;', myfile)
['size=50;', 'size=50;', 'size=50;', 'size=50;']
>>> re.findall('size=51;', myfile)
['size=51;', 'size=51;', 'size=51;']
>>> re.findall('size=(50|51);', myfile)
['51', '51', '51', '50', '50', '50', '50']
>>> re.findall(r'size=(50|51);', myfile)
['51', '51', '51', '50', '50', '50', '50']
The "size=" part of the match is gone (Yet it is certainly used in the search, otherwise there would be more results). What am I doing wrong?
The problem you have is that if the regex that re.findall tries to match captures groups (i.e. the portions of the regex that are enclosed in parentheses), then it is the groups that are returned, rather than the matched string.
One way to solve this issue is to use non-capturing groups (prefixed with ?:).
>>> import re
>>> s = 'size=50;size=51;'
>>> re.findall('size=(?:50|51);', s)
['size=50;', 'size=51;']
If the regex that re.findall tries to match does not capture anything, it returns the whole of the matched string.
Although using character classes might be the simplest option in this particular case, non-capturing groups provide a more general solution.
When a regular expression contains parentheses, they capture their contents to groups, changing the behaviour of findall() to only return those groups. Here's the relevant section from the docs:
(...)
Matches whatever regular expression is inside the parentheses,
and indicates the start and end of a group; the contents of a group
can be retrieved after a match has been performed, and can be matched
later in the string with the \number special sequence, described
below. To match the literals '(' or ')', use \( or \), or enclose them
inside a character class: [(] [)].
To avoid this behaviour, you can use a non-capturing group:
>>> re.findall(r'size=(?:50|51);',myfile)
['size=51;', 'size=51;', 'size=51;', 'size=50;', 'size=50;', 'size=50;', 'size=50;']
Again, from the docs:
(?:...)
A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.
In some cases, the non-capturing group is not appropriate, for example with regex which detects repeated words (example from python docs)
r'(\b\w+)\s+\1'
In this situation to get whole match one can use
[groups[0] for groups in re.findall(r'((\b\w+)\s+\2)', text)]
Note that \1 has changed to \2.
As others mentioned, the "problem" with re.findall is that it returns a list of strings/tuples-of-strings depending on the use of capture groups. If you don't want to change the capture groups you're using (not to use character groups [] or non-capturing groups (?:)), you can use finditer instead of findall. This gives an iterator of Match objects, instead of just strings. So now you can fetch the full match, even when using capture groups:
import re
s = 'size=50;size=51;'
for m in re.finditer('size=(50|51);', s):
print(m.group())
Will give:
size=50;
size=51;
And if you need a list, similar to findall, you can use a list-comprehension:
>>> [m.group() for m in re.finditer('size=(50|51);', s)]
['size=50;', 'size=51;']
'size=(50|51);' means you are looking for size=50 or size=51 but only matching the 50 or 51 part (note the parentheses), therefore it does not return the sign=.
If you want the sign= returned, you can do:
re.findall('(size=50|size=51);',myfile)
I think what you want is using [] instead of (). [] indicates a set of characters while () indicates a group match. Try something like this:
re.findall('size=5[01];', myfile)

Confusing Behaviour of regex in Python

I'm trying to match a specific pattern using the re module in python.
I wish to match a full sentence (More correctly I would say that they are alphanumeric string sequences separated by spaces and/or punctuation)
Eg.
"This is a regular sentence."
"this is also valid"
"so is This ONE"
I'm tried out of various combinations of regular expressions but I am unable to grasp the working of the patterns properly, with each expression giving me a different yet inexplicable result (I do admit I am a beginner, but still).
I'm tried:
"((\w+)(\s?))*"
To the best of my knowledge this should match one or more alpha alphanumerics greedily followed by either one or no white-space character and then it should match this entire pattern greedily. This is not what it seems to do, so clearly I am wrong but I would like to know why. (I expected this to return the entire sentence as the result)
The result I get for the first sample string mentioned above is [('sentence', 'sentence', ''), ('', '', ''), ('', '', ''), ('', '', '')].
"(\w+ ?)*"
I'm not even sure how this one should work. The official documentation(python help('re')) says that the ,+,? Match x or x (greedy) repetitions of the preceding RE.
In such a case is simply space the preceding RE for '?' or is '\w+ ' the preceding RE? And what will be the RE for the '' operator? The output I get with this is ['sentence'].
Others such as "(\w+\s?)+)" ; "((\w*)(\s??)) etc. which are basically variation of the same idea that the sentence is a set of alpha numerics followed by a single/finite number of white spaces and this pattern is repeated over and over.
Can someone tell me where I go wrong and why, and why the above expressions do not work the way I was expecting them to?
P.S I eventually got "[ \w]+" to work for me but With this I cannot limit the number of white-space characters in continuation.
Your reasoning about the regex is correct, your problem is coming from using capturing groups with *. Here's an alternative:
>>> s="This is a regular sentence."
>>> import re
>>> re.findall(r'\w+\s?', s)
['This ', 'is ', 'a ', 'regular ', 'sentence']
In this case it might make more sense for you to use \b in order to match word boundries.
>>> re.findall(r'\w+\b', s)
['This', 'is', 'a', 'regular', 'sentence']
Alternatively you can match the entire sentence via re.match and use re.group(0) to get the whole match:
>>> r = r"((\w+)(\s?))*"
>>> s = "This is a regular sentence."
>>> import re
>>> m = re.match(r, s)
>>> m.group(0)
'This is a regular sentence'
Here's an awesome Regular Expression tutorial website:
http://regexone.com/
Here's a Regular Expression that will match the examples given:
([a-zA-Z0-9,\. ]+)
Why do you want to limit the number of white space character in continuation? Because a sentence can have any number of words (sequences of alphanumeric characters) and spaces in a row, but rather a sentence is the area of text that ends with a punctuation mark or rather something that is not in the above sequence including white space.
([a-zA-Z0-9\s])*
The above regex will match a sentence wherein it is a series or spaces in series zero or more times. You can refine it to be the following though:
([a-zA-Z0-9])([a-zA-Z0-9\s])*
Which simply states that the above sequence must be prefaced with a alphanumeric character.
Hope this is what you were looking for.
Maybe this will help:
import re
source = """
This is a regular sentence.
this is also valid
so is This ONE
how about this one followed by this one
"""
re_sentence = re.compile(r'[^ \n.].*?(\.|\n| +)')
def main():
i = 0
for s in re_sentence.finditer(source):
print "%d:%s" % (i, s.group(0))
i += 1
if __name__ == '__main__':
main()
I am using alternation in the expression (\.|\n| +) to describe the end-of-sentence condition. Note the use of two spaces in the third alternation. The second space has the '+' meta-character so that two or more spaces in a row will be an end-of-sentence.

Categories

Resources