Python Regex Help Needed (Basic) - python

I need a python regex which can help me eliminate illegal characters inside a word.
The conditions are as such:
The first character must be a-z only
All characters in the word should only be a-z (lower case) plus apostrophe ' and hyphen -
The last character must be a-z or apostrophe ' only
You can assume that the word is always lower-case
Test data:
s = "there is' -potato 'all' around- 'the 'farm-"
Expected output:
>>>print(s)
there is' potato all' around the farm
My code is currently as such but it doesn't work correctly:
newLine = re.findall(r'[a-z][-\'a-z]*[\'a-z]?', s)
Any assistance would be greatly appreciated! Thanks!

Just match only the chars you don't want and remove ith through re.sub
>>> import re
>>> s = """potato
-potato
'human'
potatoes-"""
>>> m = re.sub(r"(?m)^['-]|-$", r'', s)
>>> print(m)
potato
potato
human'
potatoes
OR
>>> m = re.sub(r"(?m)^(['-])?([a-z'-]*?)-?$", r'\2', s)
>>> print(m)
potato
potato
human'
potatoes
DEMO

Try this:
>>> b=re.findall(r'[a-z][-\'a-z]*[\'a-z]',a)
>>> for i in b: print i
...
potato
potato
human'
potatoes

You can try:
[a-z][a-z'\-]*[a-z]|[a-z]

Well assuming every word is separated by a space you could find all the valid words with something like this regex:
(?<= |^)[a-z](?:(?:[\-\'a-z]+)?[\'a-z])?(?= |$)
But if you want to eliminate illegal characters I guess you're better of finding the illegal characters and removing them.
Now we assume again that you got a string which should only contain words which are seperated by spaces and nothing else in it.
So first of all we can sub all invalid characters out of the string: [^a-z-' ]
After doing this the only thing that could still be invalid would be a ' or - in the beginning of the word or a - in the end of the word.
So we sub those out with this regex: (?<= |^)['-]+|-+(?= |$)

Related

Find consecutive capitalized words in a string, including apostrophes

I am using regex to find all instances of consecutive words that are both capitalized, and where some of the consecutive words contain an apostrophe, ie ("The mother-daughter bakery, Molly’s Munchies, was founded in 2009"). And I have written a few lines of code to do this:
string = "The mother-daughter bakery, Molly’s Munchies, was founded in 2009"
test = re.findall("([A-Z][a-z]+(?=\s[A-Z])(?:\s[A-Z][a-z]+)+)", string)
print(test)
The issue is I am unable to print the result ('Molly's Munchies')
Instead my output is:
('[]')
Desired output:
("Molly's Munchies")
Any help appreciated, thank you!
You may use this regex in python:
r"\b[A-Z][a-z'’]*(?:\s+[A-Z][a-z'’]*)+"
RegEx Demo
RegEx Details:
\b: Word match
[A-Z]: Match a capital letter
[a-z'’]*: Match 0 or more characters containing lowercase letter or ' or ’
(?:\s+[A-Z][a-z'’]*)+ Match 1 or more such capital letter words
You would need to add it in both places you define a "word". You only added it in one place.
string = "The Cow goes moo, and the Dog's Name is orange"
# e.g. both here and here
# v v
print(re.findall("([A-Z][a-z']+(?=\s[A-Z])(?:\s[A-Z][a-z']+)+)", string))
['The Cow', "Dog's Name"]

Remove Whitespaces before Capital Letters using re

It's quite simple but I'm relatively new using Regex. I would like to change the following string:
" I love cats", " I love dogs"
"I love cats", "I love dogs"
I just want to know the setup for removing spaces before any sort of pattern. In this instance, a Capital Letter.
You can use a lookahead assertion combined with re.sub():
import re
s = ' I love cats'
re.sub(r'''^ # match beginning of string
\s+ # match one or more instances of whitespace
(?=[A-Z]) # positive lookahead assertion of an uppercase character
''','',s,flags=re.VERBOSE)
And to show you that the whitespace is not removed before a lowercase letter:
s = ' this is a test'
re.sub(r'^\s+(?=[A-Z])','',s)
Result:
' this is a test'

Using regex to find all phrases that are completely capitalized

I want to use regex to match with all substrings that are completely capitalized, included the spaces.
Right now I am using regexp: \w*[A-Z]\s]
HERE IS Test WHAT ARE WE SAYING
Which returns:
HERE
IS
WHAT
ARE
WE
SAYING
However, I would like it to match with all substrings that are allcaps, so that it returns:
HERE IS
WHAT ARE WE SAYING
You can use word boundaries \b and [^\s] to prevent starting and ending spaces. Put together it might look a little like:
import re
string = "HERE IS Test WHAT ARE WE SAYING is that OKAY"
matches = re.compile(r"\b[^\s][A-Z\s]+[^\s]\b")
matches.findall(string)
>>> ['HERE IS', 'WHAT ARE WE SAYING', 'OKAY']
You could use findall:
import re
text = 'HERE IS Test WHAT ARE WE SAYING'
print(re.findall('[\sA-Z]+(?![a-z])', text))
Output
['HERE IS ', ' WHAT ARE WE SAYING']
The pattern [\sA-Z]+(?![a-z]) matches any space or capitalized letter, that is not followed by a non-capitalized letter. The notation (?![a-z]) is known as a negative lookahead (see Regular Expression Syntax).
One option is to use re.split with the pattern \s*(?:\w*[^A-Z\s]\w*\s*)+:
input = "HERE IS Test WHAT ARE WE SAYING"
parts = re.split('\s*(?:\w*[^A-Z\s]\w*\s*)+', input)
print(parts);
['HERE IS', 'WHAT ARE WE SAYING']
The idea here is to split on any sequential cluster of words which contains one or more letter which is not uppercase.
You can use [A-Z ]+ to match capital letters and spaces, and use negative lookahead (?! ) and negative lookbehind (?<! ) to forbid the first and last character from being a space.
Finally, surrounding the pattern with \b to match word boundaries will make it only match full words.
import re
text = "A ab ABC ABC abc Abc aBc abC C"
pattern = r'\b(?! )[A-Z ]+(?<! )\b'
re.findall(pattern, text)
>>> ['A', 'ABC ABC', 'C']
You can also use the following method:
>>> import re
>>> s = 'HERE IS Test WHAT ARE WE SAYING'
>>> print(re.findall('((?!\s+)[A-Z\s]+(?![a-z]+))', s))
OUTPUT:
['HERE IS ', 'WHAT ARE WE SAYING']
Using findall() without matching leading and trailing spaces:
re.findall(r"\b[A-Z]+(?:\s+[A-Z]+)*\b",s)
Out: ['HERE IS', 'WHAT ARE WE SAYING']

regex for repeating words in a string in Python

I have a good regexp for replacing repeating characters in a string. But now I also need to replace repeating words, three or more word will be replaced by two words.
Like
bye! bye! bye!
should become
bye! bye!
My code so far:
def replaceThreeOrMoreCharachetrsWithTwoCharacters(string):
# pattern to look for three or more repetitions of any character, including newlines.
pattern = re.compile(r"(.)\1{2,}", re.DOTALL)
return pattern.sub(r"\1\1", string)
Assuming that what is called "word" in your requirements is one or more non-whitespaces characters surrounded by whitespaces or string limits, you can try this pattern:
re.sub(r'(?<!\S)((\S+)(?:\s+\2))(?:\s+\2)+(?!\S)', r'\1', s)
You could try the below regex also,
(?<= |^)(\S+)(?: \1){2,}(?= |$)
Sample code,
>>> import regex
>>> s = "hi hi hi hi some words words words which'll repeat repeat repeat repeat repeat"
>>> m = regex.sub(r'(?<= |^)(\S+)(?: \1){2,}(?= |$)', r'\1 \1', s)
>>> m
"hi hi some words words which'll repeat repeat"
DEMO
I know you were after a regular expression but you could use a simple loop to achieve the same thing:
def max_repeats(s, max=2):
last = ''
out = []
for word in s.split():
same = 0 if word != last else same + 1
if same < max: out.append(word)
last = word
return ' '.join(out)
As a bonus, I have allowed a different maximum number of repeats to be specified (the default is 2). If there is more than one space between each word, it will be lost. It's up to you whether you consider that to be a bug or a feature :)
Try the following:
import re
s = your string
s = re.sub( r'(\S+) (?:\1 ?){2,}', r'\1 \1', s )
You can see a sample code here: http://codepad.org/YyS9JCLO
def replaceThreeOrMoreWordsWithTwoWords(string):
# Pattern to look for three or more repetitions of any words.
pattern = re.compile(r"(?<!\S)((\S+)(?:\s+\2))(?:\s+\2)+(?!\S)", re.DOTALL)
return pattern.sub(r"\1", string)

How to explode sentences with "。" but ignore the "。" in the double quotation marks

I am writing a program about getting the abstract of Chinese article. Firstly I have to explode each sentence with symbols like “。!?”.
In Chinese article, when referring other's word, they would use double quotation marks to mark the referred words, which may contain "。" but should not be exploded. For example, the following sentence:
他说:“今天天气很好。我很开心。”
It will be exploded into three sentences:
他说:“今天天气很好
我很开心
”
The result is wrong, but how to solved it?
I have tried use regular expression, but I am not good at it, so could figure it out.
PS: I write this program with python3
Instead of splitting, I’m matching all sentences using re.findall:
>>> s = '今天天气很好。今天天气很好。今天天气很好。他说:“今天天气很好。我很开心。”'
>>> re.findall('[^。“]+(?:。|“.*?”)', s)
['今天天气很好。', '今天天气很好。', '今天天气很好。', '他说:“今天天气很好。我很开心。”']
If you want to accept those other character as separators too, try this:
>>> re.findall('[^。?!;~“]+(?:[。?!;~]|“.*?”)', s)
Use a regex:
import re
st=u'''\
今天天气很好。今天天气很好。bad? good! 今天天气很好。他说:“今天天气很好。我很开心。”
Sentence one. Sentence two! “Sentence three. Sentence four.” Sentence five?'''
pat=re.compile(r'(?:[^“。?!;~.]*?[?!。.;~])|(?:[^“。?!;~.]*?“[^”]*?”)')
print(pat.findall(st))
Prints:
['今天天气很好。', '今天天气很好。', 'bad?', ' good!', ' 今天天气很好。',
'他说:“今天天气很好。我很开心。”', '\nSentence one.', ' Sentence two!',
' “Sentence three. Sentence four.”', ' Sentence five?']
And if you want the effect of a split (ie, won't include the delimiter), just move the capturing parenthesis and then print the match group:
pat=re.compile(r'([^“。?!;~.]*?)[?!。.;~]|([^“。?!;~.]*?“[^”]*?”)')
# note the end paren: ^
print([t[0] if t[0] else t[1] for t in pat.findall(st)])
Prints:
['今天天气很好', '今天天气很好', 'bad', ' good', ' 今天天气很好',
'他说:“今天天气很好。我很开心。”', '\nSentence one', ' Sentence two',
' “Sentence three. Sentence four.”', ' Sentence five']
Or, use re.split with the same regex and then filter for True values:
print(list(filter(None, pat.split(st))))
First of all, I'll assume the double quotes can't be nested. Then it's quite easy to do this without some complicated regular expression. You just split on ", and then you split the even parts on your punctuation.
>>> sentence = 'a: "b. c" and d. But e said: "f? g."'
>>> sentence.split('"')
['a: ', 'b. c', ' and d. But e said: ', 'f? g.', '']
You can see how the even parts are the ones not between quotes. We'll use index % 2 == 1 to select the odd parts.
result = []
part = []
for i, p in enumerate(sentence.split('"')):
if i % 2 == 1:
part.append(p)
else:
parts = p.split('.')
if len(parts) == 1:
part.append(p)
else:
first, *rest, last = parts
part.append(first)
result.append('"'.join(part))
result.extend(rest)
part = [last]
result.append('"'.join(part))
I think you need to do this in two steps: first, find the dots inside the double quotes, and "protect" them (for example, replace them with a string like $%$%$%$ that is unlikely to appear in a Chinese text.). Next, explode the strings as before. Finally, replace the $%$%$%$ with a dot again.
May be this will work:
$str = '他说:“今天天气很好。我很开心。”';
print_r( preg_split('/(?=(([^"]*"){2})*[^"]*$)。/u', $str, -1, PREG_SPLIT_NO_EMPTY) );
This makes sure that 。 is matched only when outside double quotes.
OUTPUT:
Array
(
[0] => 他说:“今天天气很好
[1] => 我很开心
[2] => ”
)

Categories

Resources