How to select words with apostrophe using regular expression - python

I am trying to separate a string into a list, but I need to have the string contain words that are joined by apostrophes. For example :
String="My name is Melvin_JESUS, Guatemala, Dean'Olvier, 501soy...# 1231 !"
should give me a result as:
['my', 'name', 'is', 'melvin', 'jesus', 'guatemala', '"dean'oliver"', 'soy']
i have tried the following regular expression:
my_patern= r"(?:^|(?<=\s)|-)[A-Za-z'\.]+(?=\s|\t|$|\b)"
but doesn't give me the desired results.

You may use
(?<![^\W\d_])[^\W\d_]+(?:['.][^\W\d_]+)*(?![^\W\d_])
See the regex demo
Details
(?<![^\W\d_]) - no letter right before the match is allowed
[^\W\d_]+ - 1 or more letters
(?:['.][^\W\d_]+)* - 0 or more sequences of ' or . and then 1+ letters
(?![^\W\d_]) - no letter right after the match is allowed.
In Python, use
re.findall(r'(?<![^\W\d_])[^\W\d_]+(?:['.][^\W\d_]+)*(?![^\W\d_])', text)

Related

python re split at all space and punctuation except for the apostrophe

i want to split a string by all spaces and punctuation except for the apostrophe sign. Preferably a single quote should still be used as a delimiter except for when it is an apostrophe. I also want to keep the delimeters.
example string
words = """hello my name is 'joe.' what's your's"""
Here is my re pattern thus far splitted = re.split(r"[^'-\w]",words.lower())
I tried throwing the single quote after the ^ character but it is not working.
My desired output is this. splitted = [hello,my,name,is,joe,.,what's,your's]
It might be simpler to simply process your list after splitting without accounting for them at first:
>>> words = """hello my name is 'joe.' what's your's"""
>>> split_words = re.split(r"[ ,.!?]", words.lower()) # add punctuation you want to split on
>>> split_words
['hello', 'my', 'name', 'is', "'joe.'", "what's", "your's"]
>>> [word.strip("'") for word in split_words]
['hello', 'my', 'name', 'is', 'joe.', "what's", "your's"]
One option is to make use of lookarounds to split at the desired positions, and use a capture group what you want to keep in the split.
After the split, you can remove the empty entries from the resulting list.
\s+|(?<=\s)'|'(?=\s)|(?<=\w)([,.!?])
The pattern matches
\s+ Match 1 or more whitespace chars
| Or
(?<=\s)' Match ' preceded by a whitespace char
| Or
'(?=\s) Match ' when followed by a whitespace char
| Or
(?<=\w)([,.!?]) Capture one of , . ! ? in group 1, when preceded by a word character
See a regex demo and a Python demo.
Example
import re
pattern = r"\s+|(?<=\s)'|'(?=\s)|(?<=\w)([,.!?])"
words = """hello my name is 'joe.' what's your's"""
result = [s for s in re.split(pattern, words) if s]
print(result)
Output
['hello', 'my', 'name', 'is', 'joe', '.', "what's", "your's"]
I love regex golf!
words = """hello my name is 'joe.' what's your's"""
splitted = re.findall(r"\b(?:\w'\w|\w)+\b", words)
The part in the parenthesis is a group that matches either an apostrophe surrounded by letters or a single letter.
EDIT:
This is more flexible:
re.findall(r"\b(?:(?<=\w)'(?=\w)|\w)+\b", words)
It's getting a bit unreadable at this point though, in practice you should probably use Woodford's answer.

Trying to repeat the regex breaks the regex

I have a working regex that matches ONE of the following lines:
A punctuation from the following list [.,!?;]
A word that is preceded by the beginning of the string or a space.
Here's the regex in question ([.,!?;] *|(?<= |\A)[\-'’:\w]+)
What I need it to do however is for it to match 3 instances of this. So, for example, the ideal end result would be something like this.
Sample text: "This is a test. Test"
Output
"This" "is" "a"
"is" "a" "test"
"a" "test" "."
"test" "." "Test"
I've tried simply adding {3} to the end in the hopes of it matching 3 times. This however results in it matching nothing at all or the occasional odd character. The other possibility I've tried is just repeating the whole regex 3 times like so ([.,!?;] *|(?<= |\A)[\-'’:\w]+)([.,!?;] *|(?<= |\A)[\-'’:\w]+)([.,!?;] *|(?<= |\A)[\-'’:\w]+) which is horrible to look at but I hoped it would work. This had the odd effect of working, but only if at least one of the matches was one of the previously listed punctuation.
Any insights would be appreciated.
I'm using the new regex module found here so that I can have overlapping searches.
What is wrong with your approach
The ([.,!?;] *|(?<= |\A)[\-'’:\w]+) pattern matches a single "unit" (either a word or a single punctuation from the specified set [.,!?;] followed with 0+ spaces. Thus, when you fed this pattern to the regex.findall, it only could return just the chunk list ['This', 'is', 'a', 'test', '. ', 'Test'].
Solution
You can use a slightly different approach: match all words, and all chunks that are not words. Here is a demo (note that C'est and AUX-USB are treated as single "words"):
>>> pat = r"((?:[^\w\s'-]+(?=\s|\b)|\b(?<!')\w+(?:['-]\w+)*))\s*((?1))\s*((?1))"
>>> results = regex.findall(pat, text, overlapped = True)
>>> results
[("C'est", 'un', 'test'), ('un', 'test', '....'), ('test', '....', 'aux-usb')]
Here, the pattern has 3 capture groups, and the second and third one contain the same pattern as in Group 1 ((?1) is a subroutine call used in order to avoid repeating the same pattern used in Group 1). Group 2 and Group 3 can be separated with whitespaces (not necessarily, or the punctuation glued to a word would not be matched). Also, note the negative lookbehind (?<!') that will ensure that C'est is treated as a single entity.
Explanation
The pattern details:
((?:[^\w\s'-]+(?=\s|\b)|\b(?<!')\w+(?:['-]\w+)*)) - Group 1 matching:
(?:[^\w\s'-]+(?=\s|\b) - 1+ characters other than [a-zA-Z0-9_], whitespace, ' and - immediately followed with a whitespace or a word boundary
| - or
\b(?<!')\w+(?:['-]\w+)*) - 1+ word characters not preceded with a ' (due to (?<!')) and preceded with a word boundary (\b) and followed with 0+ sequences of - or ' followed with 1+ word characters.
\s* - 0+ whitespaces
((?1)) - Group 2 (same pattern as for Group 1)
\s*((?1)) - see above

How to change a quantifier in a Regex based on a condition?

I would like to find words of length >= 1 which may contain a ' or a - within. Here is a test string:
a quake-prone area- (aujourd'hui-
In Python, I'm currently using this regex:
string = "a quake-prone area- (aujourd'hui-"
RE_WORDS = re.compile(r'[a-z]+[-\']?[a-z]+')
words = RE_WORDS.findall(string)
I would like to get this result:
>>> words
>>> [u'a', u'quake-prone', u'area', u"aujourd'hui"]
but I get this instead:
>>> words
>>> [u'quake-prone', u'area', u"aujourd'hui"]
Unfortunately, because of the last + quantifier, it skips all words of length 1. If I use the * quantifier, it will find a but also area- instead of area.
Then how could create a conditional regex saying: if the word contains an apostrophe or an hyphen, use the + quantifier else use the * quantifier ?
I suggest you to change the last [-\']?[a-z]+ part as optional by putting it into a group and then adding a ? quantifier next to that group.
>>> string = "a quake-prone area- (aujourd'hui-"
>>> RE_WORDS = re.compile(r'[a-z]+(?:[-\'][a-z]+)?')
>>> RE_WORDS.findall(string)
['a', 'quake-prone', 'area', "aujourd'hui"]
Reason for why the a is not printed is because of your regex contains two [a-z]+ which asserts that there must be atleast two lowercase letters present in the match.
Note that the regex i mentioned won't match area- because (?:[-\'][a-z]+)? optional group asserts that there must be atleast one lowercase letter would present just after to the - symbol. If no, then stop matching until it reaches the hyphen. So that you got area at the output instead of area- because there isn't an lowercase letter exists next to the -. Here it stops matching until it finds an hyphen without following lowercase letter.

Exclude matched string python re.findall

I am using python's re.findall method to find occurrence of certain string value in Input string.
e.g. From search in 'ABCdef' string, I have two search requirements.
Find string starting from Single Capital letter.
After 1 find string that contains all capital letter.
e.g. input string and expected output will be:
'USA' -- output: ['USA']
'BObama' -- output: ['B', 'Obama']
'Institute20CSE' -- output: ['Institute', '20', 'CSE']
So My expectation from
>>> matched_value_list = re.findall ( '[A-Z][a-z]+|[A-Z]+' , 'ABCdef' )
is to return ['AB', 'Cdef'].
But which does Not seems to be happening. What I get is ['ABC'] as return value, which matches later part of regex with full string.
So Is there any way we can ignore found matches. So that once 'Cdef' is matched with '[A-Z][a-z]+'. second part of regex (i.e. '[A-Z]+') only matches with remaining string 'AB'?
First you need to match AB, which is followed by an Uppercase alphabet and then a lowercase alphabet. or is at the end of the string. For that you can use look-ahead.
Then you need to match an Uppercase alphabet C, followed by multiple lowercase alphabets def.
So, you can use this pattern:
>>> s = "ABCdef"
>>> re.findall("([A-Z]+(?=[A-Z][a-z]|$)|[A-Z][a-z]+)", s)
['AB', 'Cdef']
>>> re.findall("([A-Z]+(?=[A-Z][a-z]|$)|[A-Z][a-z]+)", 'MumABXYZCdefXYZAbc')
['Mum', 'ABXYZ', 'Cdef', 'XYZ', 'Abc']
As pointed out in comment by #sotapme, you can also modify the above regex to: -
"([A-Z]+(?=[A-Z]|$)|[A-Z][a-z]+|\d+)"
Added \d+ since you also want to match digit as in one of your example. Also, he removed [a-z] part from the first part of look-ahead. That works because, + quantifier on the [A-Z] outside is greedy by default, so, it will automatically match maximum string, and will stop only before the last upper case alphabet.
You can use this regex
[A-Z][a-zA-Z]*?(?=[A-Z][a-z]|[^a-zA-Z]|$)

Confusing Behaviour of regex in Python

I'm trying to match a specific pattern using the re module in python.
I wish to match a full sentence (More correctly I would say that they are alphanumeric string sequences separated by spaces and/or punctuation)
Eg.
"This is a regular sentence."
"this is also valid"
"so is This ONE"
I'm tried out of various combinations of regular expressions but I am unable to grasp the working of the patterns properly, with each expression giving me a different yet inexplicable result (I do admit I am a beginner, but still).
I'm tried:
"((\w+)(\s?))*"
To the best of my knowledge this should match one or more alpha alphanumerics greedily followed by either one or no white-space character and then it should match this entire pattern greedily. This is not what it seems to do, so clearly I am wrong but I would like to know why. (I expected this to return the entire sentence as the result)
The result I get for the first sample string mentioned above is [('sentence', 'sentence', ''), ('', '', ''), ('', '', ''), ('', '', '')].
"(\w+ ?)*"
I'm not even sure how this one should work. The official documentation(python help('re')) says that the ,+,? Match x or x (greedy) repetitions of the preceding RE.
In such a case is simply space the preceding RE for '?' or is '\w+ ' the preceding RE? And what will be the RE for the '' operator? The output I get with this is ['sentence'].
Others such as "(\w+\s?)+)" ; "((\w*)(\s??)) etc. which are basically variation of the same idea that the sentence is a set of alpha numerics followed by a single/finite number of white spaces and this pattern is repeated over and over.
Can someone tell me where I go wrong and why, and why the above expressions do not work the way I was expecting them to?
P.S I eventually got "[ \w]+" to work for me but With this I cannot limit the number of white-space characters in continuation.
Your reasoning about the regex is correct, your problem is coming from using capturing groups with *. Here's an alternative:
>>> s="This is a regular sentence."
>>> import re
>>> re.findall(r'\w+\s?', s)
['This ', 'is ', 'a ', 'regular ', 'sentence']
In this case it might make more sense for you to use \b in order to match word boundries.
>>> re.findall(r'\w+\b', s)
['This', 'is', 'a', 'regular', 'sentence']
Alternatively you can match the entire sentence via re.match and use re.group(0) to get the whole match:
>>> r = r"((\w+)(\s?))*"
>>> s = "This is a regular sentence."
>>> import re
>>> m = re.match(r, s)
>>> m.group(0)
'This is a regular sentence'
Here's an awesome Regular Expression tutorial website:
http://regexone.com/
Here's a Regular Expression that will match the examples given:
([a-zA-Z0-9,\. ]+)
Why do you want to limit the number of white space character in continuation? Because a sentence can have any number of words (sequences of alphanumeric characters) and spaces in a row, but rather a sentence is the area of text that ends with a punctuation mark or rather something that is not in the above sequence including white space.
([a-zA-Z0-9\s])*
The above regex will match a sentence wherein it is a series or spaces in series zero or more times. You can refine it to be the following though:
([a-zA-Z0-9])([a-zA-Z0-9\s])*
Which simply states that the above sequence must be prefaced with a alphanumeric character.
Hope this is what you were looking for.
Maybe this will help:
import re
source = """
This is a regular sentence.
this is also valid
so is This ONE
how about this one followed by this one
"""
re_sentence = re.compile(r'[^ \n.].*?(\.|\n| +)')
def main():
i = 0
for s in re_sentence.finditer(source):
print "%d:%s" % (i, s.group(0))
i += 1
if __name__ == '__main__':
main()
I am using alternation in the expression (\.|\n| +) to describe the end-of-sentence condition. Note the use of two spaces in the third alternation. The second space has the '+' meta-character so that two or more spaces in a row will be an end-of-sentence.

Categories

Resources