Regex for Matching Apostrophe 's' words

Regex for Matching Apostrophe 's' words - python

I'm trying to create a regex to match a word that has or doesn't have an apostrophe 's' at the end. For the below example, I'd like add a regex to replace the apostrophe with the regex to match either an apostrophe 's' or just an 's'.
Philip K Dick's Electric Dreams
Philip K Dicks Electric Dreams
What I am trying so far is below, but I'm not getting it to match correctly. Any help here is great. Thanks!
Philip K Dick[\'[a-z]|[a-z]] Electric Dreams

Just set the apostrophe as optional in the regex pattern.
Like this: [a-zA-Z]+\'?s,
For example, using your test strings:
import re
s1 = "Philip K Dick's Electric Dreams"
s2 = "Philip K Dicks Electric Dreams"
>>> re.findall("[a-zA-Z]+\'?s", s1)
["Dick's", 'Dreams']
>>> re.findall("[a-zA-Z]+\'?s", s2)
['Dicks', 'Dreams']

You can use the regex (\w+)'s to represent any letters followed by 's. Then you can substitute back that word followed by just s.
>>> s = "Philip K Dick's Electric Dreams"
>>> re.sub(r"(\w+)'s", r'\1s', s)
'Philip K Dicks Electric Dreams'

Related

REGEX_String between strings in a list

From this list:
['AUSTRALIA\nBELMONT PARK (WA', '\nR3\n1/5/4/2\n2/3/1/5\nEAGLE FARM (QLD']
I would like to reduce it to this list:
['BELMONT PARK', 'EAGLE FARM']
You can see from the first list that the desired words are between '\n' and '('.
My attempted solution is:
for i in x:
result = re.search('\n(.*)(', i)
print(result.group(1))
This returns the error 'unterminated subpattern'.
Thankyou

You’re getting an error because the ( is unescaped. Regardless, it will not work, as you’ll get the following matches:
\nBELMONT PARK (
\nR3\n1/5/4/2\n2/3/1/5\nEAGLE FARM (
You can try the following:
(?<=\\n)(?!.*\\n)(.*)(?= \()
(?<=\\n): Positive lookbehind to ensure \n is before match
(?!.*\\n): Negative lookahead to ensure no further \n is included
(.*): Your match
(?= \(): Positive lookahead to ensure ( is after match

You can get the matches without using any lookarounds, as you are already using a capture group.
\n(.*) \(
Explanation
\n Match a newline
(.*) Capture group 1, match any character except a newline, as much as possible
\( Match a space and (
See a regex101 demo and a Python demo.
Example
import re
x = ['AUSTRALIA\nBELMONT PARK (WA', '\nR3\n1/5/4/2\n2/3/1/5\nEAGLE FARM (QLD']
pattern = r"\n(.*) \("
for i in x:
m = re.search(pattern, i)
if m:
print(m.group(1))
Output
BELMONT PARK
EAGLE FARM
If you want to return a list:
x = ['AUSTRALIA\nBELMONT PARK (WA', '\nR3\n1/5/4/2\n2/3/1/5\nEAGLE FARM (QLD']
pattern = r"\n(.*) \("
res = [m.group(1) for i in x for m in [re.search(pattern, i)] if m]
print(res)
Output
['BELMONT PARK', 'EAGLE FARM']

Remove space delimited single characters

I have texts that look like this:
the quick brown fox 狐狸 m i c r o s o f t マ イ ク ロ ソ フ ト jumps over the lazy dog 跳過懶狗 best wishes : John Doe
What's a good regex (for python) that can remove the single-characters so that the output looks like this:
the quick brown fox 狐狸 jumps over the lazy dog 跳過懶狗 best wishes John Doe
I've tried some combinations of \s{1}\S{1}\s{1}\S{1}, but they inevitably end up removing more letters than I need.

You can replace the following with empty string:
(?<!\S)\S(?!\S).?
Match a non-space that has no non-spaces on either side of it (i.e. surrounded by spaces), plus the character after that (if any).
The reason why I used negative lookarounds is because it neatly handles the start/end of string case. We match the extra character that follows the \S to remove the space as well.
Regex101 Demo

A non-regex version might look like:
source_string = r"this is a string I created"
modified_string =' '.join([x for x in source_string.split() if len(x)>1])
print(modified_string)

Please try the below code using regex, where I am looking for at-least two occurrences of characters that can remove a single character problem.
s='the quick brown fox 狐狸 m i c r o s o f t マ イ ク ロ ソ フ ト jumps over the lazy dog 跳過懶狗 best wishes : John Doe'
output = re.findall('\w{2,}', s)
output = ' '.join([x for x in output])
print(output)

Regex to match strings in quotes that contain only 3 or less capitalized words

I've searched and searched, but can't find an any relief for my regex woes.
I wrote the following dummy sentence:
Watch Joe Smith Jr. and Saul "Canelo" Alvarez fight Oscar de la Hoya and Genaddy Triple-G Golovkin for the WBO belt GGG. Canelo Alvarez and Floyd 'Money' Mayweather fight in Atlantic City, New Jersey. Conor MacGregor will be there along with Adonis Superman Stevenson and Mr. Sugar Ray Robinson. "Here Goes a String". 'Money Mayweather'. "this is not a-string", "this is not A string", "This IS a" "Three Word String".
I'm looking for a regular expression that will return the following when used in Python 3.6:
Canelo, Money, Money Mayweather, Three Word String
The regex that has gotten me the closest is:
(["'])[A-Z](\\?.)*?\1
I want it to only match strings of 3 capitalized words or less immediately surrounded by single or double quotes. Unfortunately, so far it seem to match any string in quotes, no matter what the length, no matter what the content, as long is it begins with a capital letter.
I've put a lot of time into trying to hack through it myself, but I've hit a wall. Can anyone with stronger regex kung-fu give me an idea of where I'm going wrong here?

Try to use this one: (["'])((?:[A-Z][a-z]+ ?){1,3})\1
(["']) - opening quote
([A-Z][a-z]+ ?){1,3} - Capitalized word repeating 1 to 3 times separated by space
[A-Z] - capital char (word begining char)
[a-z]+ - non-capital chars (end of word)
_? - space separator of capitalized words (_ is a space), ? for single word w/o ending space
{1,3} - 1 to 3 times
\1 - closing quote, same as opening
Group 2 is what you want.
Match 1
Full match 29-37 `"Canelo"`
Group 1. 29-30 `"`
Group 2. 30-36 `Canelo`
Match 2
Full match 146-153 `'Money'`
Group 1. 146-147 `'`
Group 2. 147-152 `Money`
Match 3
Full match 318-336 `'Money Mayweather'`
Group 1. 318-319 `'`
Group 2. 319-335 `Money Mayweather`
Match 4
Full match 398-417 `"Three Word String"`
Group 1. 398-399 `"`
Group 2. 399-416 `Three Word String`
RegEx101 Demo: https://regex101.com/r/VMuVae/4

Working with the text you've provided, I would try to use regular expression lookaround to get the words surrounded by quotes and then apply some conditions on those matches to determine which ones meet your criterion. The following is what I would do:
[p for p in re.findall('(?<=[\'"])[\w ]{2,}(?=[\'"])', txt) if all(x.istitle() for x in p.split(' ')) and len(p.split(' ')) <= 3]
txt is the text you've provided here. The output is the following:
# ['Canelo', 'Money', 'Money Mayweather', 'Three Word String']
Cleaner:
matches = []
for m in re.findall('(?<=[\'"])[\w ]{2,}(?=[\'"])', txt):
if all(x.istitle() for x in m.split(' ')) and len(m.split(' ')) <= 3:
matches.append(m)
print(matches)
# ['Canelo', 'Money', 'Money Mayweather', 'Three Word String']

Here's my go at it: ([\"'])(([A-Z][^ ]*? ?){1,3})\1

regex for removing entity names

Given tweets like the following:
Brick Brewing Co Limited (BRB) Downgraded by Cormark to Market Perform
Brinker International Inc (EAT) Upgraded by Zacks Investment Research to Hold
How do I write a regex that removes both "by Cormark" and "by Zacks Investment Research"
I tried this:
"by ([A-Za-z ]+\w to)"
using python but it requires the word "to". I would like the regex to stop before capturing the word "to".
It would also be interesting if someone could show me how to write a regex that captures camel-case examples, like "Zacks Investment Research".

You can use a positive look-ahead in order to exclude the word to:
>>> s1 = "Brick Brewing Co Limited (BRB) Downgraded by Cormark to Market Perform"
>>>
>>> s2 = "Brinker International Inc (EAT) Upgraded by Zacks Investment Research to Hold"
>>>
>>> import re
>>> re.sub(r'by[\w\s]+(?=to)','',s1)
'Brick Brewing Co Limited (BRB) Downgraded to Market Perform'
>>> re.sub(r'by[\w\s]+(?=to)','',s2)
'Brinker International Inc (EAT) Upgraded to Hold'
>>>
Note that the regex [\w\s]+ will match any combination of word characters and white spaces. If you just want to match the alphabetical characters and white space you can use [a-z\s] with re.I flag (Ignore case).

To remove all capitalized words after by, you can use
by [A-Z][a-z]*(?: +[A-Z][a-z]*)*
See regex demo
Explanation:
by - literal sequence of 3 characters b, y and a space
[A-Z][a-z]* - a capitalized word (one uppercase followed by zero or more lowercase letters)
(?: +[A-Z][a-z]*)* - zero or more sequences of...
+[A-Z][a-z]* - 1 or more spaces followed by an uppercase letter followed by zero or more lowercase letters.
A regular space may be replaced with \s in the pattern to match any whitespace. Also, to match CaMeL words, you can replace all [a-z] with [a-zA-Z].

You could also do it with str method index then just slice and add up:
>>> def remove_name(s):
b = s.index(' by ')
t = s.index(' to ')
s = s[:b]+s[t:]
return s
>>>
>>> s = 'Brick Brewing Co Limited (BRB) Downgraded by Cormark to Market Perform'
>>> remove_name(s)
'Brick Brewing Co Limited (BRB) Downgraded to Market Perform'
>>>
>>> s = "Brinker International Inc (EAT) Upgraded by Zacks Investment Research to Hold"
>>> remove_name(s)
'Brinker International Inc (EAT) Upgraded to Hold'

Python regex findall

I am trying to extract all occurrences of tagged words from a string using regex in Python 2.7.2. Or simply, I want to extract every piece of text inside the [p][/p] tags.
Here is my attempt:
regex = ur"[\u005B1P\u005D.+?\u005B\u002FP\u005D]+?"
line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
person = re.findall(pattern, line)
Printing person produces ['President [P]', '[/P]', '[P] Bill Gates [/P]']
What is the correct regex to get: ['[P] Barack Obama [/P]', '[P] Bill Gates [/p]']
or ['Barrack Obama', 'Bill Gates'].

import re
regex = ur"\[P\] (.+?) \[/P\]+?"
line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
person = re.findall(regex, line)
print(person)
yields
['Barack Obama', 'Bill Gates']
The regex ur"[\u005B1P\u005D.+?\u005B\u002FP\u005D]+?" is exactly the same
unicode as u'[[1P].+?[/P]]+?' except harder to read.
The first bracketed group [[1P] tells re that any of the characters in the list ['[', '1', 'P'] should match, and similarly with the second bracketed group [/P]].That's not what you want at all. So,
Remove the outer enclosing square brackets. (Also remove the
stray 1 in front of P.)
To protect the literal brackets in [P], escape the brackets with a
backslash: \[P\].
To return only the words inside the tags, place grouping parentheses
around .+?.

Try this :
for match in re.finditer(r"\[P[^\]]*\](.*?)\[/P\]", subject):
# match start: match.start()
# match end (exclusive): match.end()
# matched text: match.group()

Your question is not 100% clear, but I'm assuming you want to find every piece of text inside [P][/P] tags:
>>> import re
>>> line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
>>> re.findall('\[P\]\s?(.+?)\s?\[\/P\]', line)
['Barack Obama', 'Bill Gates']

you can replace your pattern with
regex = ur"\[P\]([\w\s]+)\[\/P\]"

Use this pattern,
pattern = '\[P\].+?\[\/P\]'
Check here

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regex for Matching Apostrophe 's' words - python

You can use the regex (\w+)'s to represent any letters followed by 's. Then you can substitute back that word followed by just s. >>> s = "Philip K Dick's Electric Dreams" >>> re.sub(r"(\w+)'s", r'\1s', s) 'Philip K Dicks Electric Dreams'

Related

REGEX_String between strings in a list

Remove space delimited single characters

Regex to match strings in quotes that contain only 3 or less capitalized words

regex for removing entity names

Python regex findall

Categories

Resources