Remove Whitespaces before Capital Letters using re - python

It's quite simple but I'm relatively new using Regex. I would like to change the following string:
" I love cats", " I love dogs"
"I love cats", "I love dogs"
I just want to know the setup for removing spaces before any sort of pattern. In this instance, a Capital Letter.

You can use a lookahead assertion combined with re.sub():
import re
s = ' I love cats'
re.sub(r'''^ # match beginning of string
\s+ # match one or more instances of whitespace
(?=[A-Z]) # positive lookahead assertion of an uppercase character
''','',s,flags=re.VERBOSE)
And to show you that the whitespace is not removed before a lowercase letter:
s = ' this is a test'
re.sub(r'^\s+(?=[A-Z])','',s)
Result:
' this is a test'

Related

Using regex to find all phrases that are completely capitalized

I want to use regex to match with all substrings that are completely capitalized, included the spaces.
Right now I am using regexp: \w*[A-Z]\s]
HERE IS Test WHAT ARE WE SAYING
Which returns:
HERE
IS
WHAT
ARE
WE
SAYING
However, I would like it to match with all substrings that are allcaps, so that it returns:
HERE IS
WHAT ARE WE SAYING
You can use word boundaries \b and [^\s] to prevent starting and ending spaces. Put together it might look a little like:
import re
string = "HERE IS Test WHAT ARE WE SAYING is that OKAY"
matches = re.compile(r"\b[^\s][A-Z\s]+[^\s]\b")
matches.findall(string)
>>> ['HERE IS', 'WHAT ARE WE SAYING', 'OKAY']
You could use findall:
import re
text = 'HERE IS Test WHAT ARE WE SAYING'
print(re.findall('[\sA-Z]+(?![a-z])', text))
Output
['HERE IS ', ' WHAT ARE WE SAYING']
The pattern [\sA-Z]+(?![a-z]) matches any space or capitalized letter, that is not followed by a non-capitalized letter. The notation (?![a-z]) is known as a negative lookahead (see Regular Expression Syntax).
One option is to use re.split with the pattern \s*(?:\w*[^A-Z\s]\w*\s*)+:
input = "HERE IS Test WHAT ARE WE SAYING"
parts = re.split('\s*(?:\w*[^A-Z\s]\w*\s*)+', input)
print(parts);
['HERE IS', 'WHAT ARE WE SAYING']
The idea here is to split on any sequential cluster of words which contains one or more letter which is not uppercase.
You can use [A-Z ]+ to match capital letters and spaces, and use negative lookahead (?! ) and negative lookbehind (?<! ) to forbid the first and last character from being a space.
Finally, surrounding the pattern with \b to match word boundaries will make it only match full words.
import re
text = "A ab ABC ABC abc Abc aBc abC C"
pattern = r'\b(?! )[A-Z ]+(?<! )\b'
re.findall(pattern, text)
>>> ['A', 'ABC ABC', 'C']
You can also use the following method:
>>> import re
>>> s = 'HERE IS Test WHAT ARE WE SAYING'
>>> print(re.findall('((?!\s+)[A-Z\s]+(?![a-z]+))', s))
OUTPUT:
['HERE IS ', 'WHAT ARE WE SAYING']
Using findall() without matching leading and trailing spaces:
re.findall(r"\b[A-Z]+(?:\s+[A-Z]+)*\b",s)
Out: ['HERE IS', 'WHAT ARE WE SAYING']

Eliminating words that have two or more periods together in Python using Regex?

For example, if I have a string:
"I really..like something like....that"
I want to get only:
"I something"
Any suggestion?
If you want to do it with regex; you can to use below regex to remove them:
r"[^\.\s]+\.{2,}[^\.\s]+"g
[ Regex Demo ]
Regex explanation:
[^\.\s]+ at least one of any character instead of '.' and a white space
\.{2,} at least two or more '.'
[^\.\s]+ at least one of any character instead of '.' and a white space
or this regex:
r"\s+[^\.\s]+\.{2,}[^\.\s]+"g
^^^ for including spaces before those combination
[ Regex Demo ]
If you want to use a regex explicitly you could use the following.
import re
string = "I really..like something like....that"
with_dots = re.findall(r'\w+[.]+\w+', string)
split = string.split()
without_dots = [word for word in split if word not in with_dots]
The solution provided by rawing also works in this case.
' '.join(word for word in text.split() if '..' not in word)
You may very well use boundaries in combination with lookarounds:
\b(?<!\.)(\w+)\b(?!\.)
See a demo on regex101.com.
Broken apart, this says:
\b # a word boundary
(?<!\.) # followed by a negative lookbehind making sure there's no '.' behind
\w+ # 1+ word characters
\b # another word boundary
(?!\.) # a negative lookahead making sure there's no '.' ahead
As a whole Python snippet:
import re
string = "I really..like something like....that"
rx = re.compile(r'\b(?<!\.)(\w+)\b(?!\.)')
print(rx.findall(string))
# ['I', 'something']

Replace single quotes with double with exclusion of some elements

I want to replace all single quotes in the string with double with the exception of occurrences such as "n't", "'ll", "'m" etc.
input="the stackoverflow don\'t said, \'hey what\'"
output="the stackoverflow don\'t said, \"hey what\""
Code 1:(#https://stackoverflow.com/users/918959/antti-haapala)
def convert_regex(text):
return re.sub(r"(?<!\w)'(?!\w)|(?<!\w)'(?=\w)|(?<=\w)'(?!\w)", '"', text)
There are 3 cases: ' is NOT preceded and is NOT followed by a alphanumeric character; or is not preceded, but followed by an alphanumeric character; or is preceded and not followed by an alphanumeric character.
Issue: That doesn't work on words that end in an apostrophe, i.e.
most possessive plurals, and it also doesn't work on informal
abbreviations that start with an apostrophe.
Code 2:(#https://stackoverflow.com/users/953482/kevin)
def convert_text_func(s):
c = "_" #placeholder character. Must NOT appear in the string.
assert c not in s
protected = {word: word.replace("'", c) for word in ["don't", "it'll", "I'm"]}
for k,v in protected.iteritems():
s = s.replace(k,v)
s = s.replace("'", '"')
for k,v in protected.iteritems():
s = s.replace(v,k)
return s
Too large set of words to specify, as how can one specify persons' etc.
Please help.
Edit 1:
I am using #anubhava's brillant answer. I am facing this issue. Sometimes, there language translations which the approach fail.
Code=
text=re.sub(r"(?<!s)'(?!(?:t|ll|e?m|s|d|ve|re|clock)\b)", '"', text)
Problem:
In text, 'Kumbh melas' melas is a Hindi to English translation not plural possessive nouns.
Input="Similar to the 'Kumbh melas', celebrated by the banks of the holy rivers of India,"
Output=Similar to the "Kumbh melas', celebrated by the banks of the holy rivers of India,
Expected Output=Similar to the "Kumbh melas", celebrated by the banks of the holy rivers of India,
I am looking maybe to add a condition that somehow fixes it. Human-level intervention is the last option.
Edit 2:
Naive and long approach to fix:
def replace_translations(text):
d = enchant.Dict("en_US")
words=tokenize_words(text)
punctuations=[x for x in string.punctuation]
for i,word in enumerate(words):
print i,word
if(i!=len(words) and word not in punctuations and d.check(word)==False and words[i+1]=="'"):
text=text.replace(words[i]+words[i+1],words[i]+"\"")
return text
Are there any corner cases I am missing or are there any better approaches?
First attempt
You can also use this regex:
(?:(?<!\w)'((?:.|\n)+?'?)'(?!\w))
DEMO IN REGEX101
This regex match whole sentence/word with both quoting marks, from beginning and end, but also campure the content of quotation inside group nr 1, so you can replace matched part with "\1".
(?<!\w) - negative lookbehind for non-word character, to exclude words like: "you'll", etc., but to allow the regex to match quatations after characters like \n,:,;,. or -,etc. The assumption that there will always be a whitespace before quotation is risky.
' - single quoting mark,
(?:.|\n)+?'?) - non capturing group: one or more of any character or
new line (to match multiline sentences) with lazy quantifire (to avoid
matching from first to last single quoting mark), followed by
optional single quoting sing, if there would be two in a row
'(?!\w) - single quotes, followed by non-word character, to exclude
text like "i'm", "you're" etc. where quoting mark is beetwen words,
The s' case
However it still has problem with matching sentences with apostrophes occurs after word ending with s, like: 'the classes' hours'. I think it is impossible to distinguish with regex when s followed by ' should be treated as end of quotation, or as or s with apostrophes. But I figured out a kind of limited work around for this problem, with regex:
(?:(?<!\w)'((?:.|\n)+?'?)(?:(?<!s)'(?!\w)|(?<=s)'(?!([^']|\w'\w)+'(?!\w))))
DEMO IN REGEX101
PYTHON IMPLEMENTATION
with additional alternative for cases with s': (?<!s)'(?!\w)|(?<=s)'(?!([^']|\w'\w)+'(?!\w) where:
(?<!s)'(?!\w) - if there is no s before ', match as regex above (first attempt),
(?<=s)'(?!([^']|\w'\w)+'(?!\w) - if there is s before ', end a match on this ' only if there is no other ' followed by non-word
character in following text, before end or before another ' (but only ' preceded by letter other than s, or opening of next quotaion). The \w'\w is to include in such match a ' wich are between letters, like in i'm, etc.
this regex should match wrong only it there is couple s' cases in a row. Still, it is far from perfect solution.
Flaws of \w
Also, using \w there is always chance that ' would occur after sybol or non-[a-zA-Z_0-9] but still letter character, like some local language character, and then it will be treated as beginning of a quatation. It could be avoided by replacing (?<!\w) and (?!\w) with (?<!\p{L}) and (?!\p{L}) or something like (?<=^|[,.?!)\s]), etc., positive lookaround for characters wich can occour in sentence before quatation. However a list could be quite long.
You can use:
input="I'm one of the persons' stackoverflow don't th'em said, 'hey what' I'll handle it."
print re.sub(r"(?<!s)'(?!(?:t|ll|e?m)\b)", '"', input)
Output:
I'm one of the persons' stackoverflow don't th'em said, "hey what" I'll handle it.
RegEx Demo
Try this: you can use this regex ((?<=\s)'([^']+)'(?=\s)) and replace with "\2"
import re
p = re.compile(ur'((?<=\s)\'([^\']+)\'(?=\s))')
test_str = u"I'm one of the persons' stackoverflow don't th'em said, 'hey what' I'll handle it."
subst = u"\"\2\""
result = re.sub(p, subst, test_str)
Output
I'm one of the persons' stackoverflow don't th'em said, "hey what" I'll handle it.
Demo
Here is a non-regex way of doing it
text="the stackoverflow don't said, 'hey what'"
out = []
for i, j in enumerate(text):
if j == '\'':
if text[i-1:i+2] == "n't" or text[i:i+3] == "'ll" or text[i:i+3] == "'m":
out.append(j)
else:
out.append('"')
else:
out.append(j)
print ''.join(out)
gives as an output
the stackoverflow don't said, "hey what"
Of course, you can improve the exclusion list to not have to use manually check each exclusion...
Here is another possible way of doing it:
import re
text = "I'm one of the persons' stackoverflow don't th'em said, 'hey what' I'll handle it."
print re.sub("((?<!s)'(?!\w+)|(\s+'))", '"', text)
I have tried to avoid the need for special cases, it gives:
I'm one of the persons' stackoverflow don't th'em said,"hey what" I'll handle it.

Python Regex Help Needed (Basic)

I need a python regex which can help me eliminate illegal characters inside a word.
The conditions are as such:
The first character must be a-z only
All characters in the word should only be a-z (lower case) plus apostrophe ' and hyphen -
The last character must be a-z or apostrophe ' only
You can assume that the word is always lower-case
Test data:
s = "there is' -potato 'all' around- 'the 'farm-"
Expected output:
>>>print(s)
there is' potato all' around the farm
My code is currently as such but it doesn't work correctly:
newLine = re.findall(r'[a-z][-\'a-z]*[\'a-z]?', s)
Any assistance would be greatly appreciated! Thanks!
Just match only the chars you don't want and remove ith through re.sub
>>> import re
>>> s = """potato
-potato
'human'
potatoes-"""
>>> m = re.sub(r"(?m)^['-]|-$", r'', s)
>>> print(m)
potato
potato
human'
potatoes
OR
>>> m = re.sub(r"(?m)^(['-])?([a-z'-]*?)-?$", r'\2', s)
>>> print(m)
potato
potato
human'
potatoes
DEMO
Try this:
>>> b=re.findall(r'[a-z][-\'a-z]*[\'a-z]',a)
>>> for i in b: print i
...
potato
potato
human'
potatoes
You can try:
[a-z][a-z'\-]*[a-z]|[a-z]
Well assuming every word is separated by a space you could find all the valid words with something like this regex:
(?<= |^)[a-z](?:(?:[\-\'a-z]+)?[\'a-z])?(?= |$)
But if you want to eliminate illegal characters I guess you're better of finding the illegal characters and removing them.
Now we assume again that you got a string which should only contain words which are seperated by spaces and nothing else in it.
So first of all we can sub all invalid characters out of the string: [^a-z-' ]
After doing this the only thing that could still be invalid would be a ' or - in the beginning of the word or a - in the end of the word.
So we sub those out with this regex: (?<= |^)['-]+|-+(?= |$)

python regex: use first blank as sep but maintain rest of blank sequence

I'm fighting too long on this regex now.
The split should use blank as separator
but maintain the remaining ones in a blank sequence to the next token
'123 45 678 123.0'
=>
'123', '45', ' 678', ' 123.0'
My numbers are floats as well and the group count is unknown.
What about using a lookbehind assertion?:
>>> import re
>>> regex = re.compile(r'(?<=[^\s])\s')
>>> regex.split('this is a string')
['this', ' is', 'a', ' string']
regex breakdown:
(?<=...) #lookbehind. Only match if the `...` matches before hand
[^\s] #Anything that isn't whitespace
\s #single whitespace character
In english, this translates to "match a single whitespace character if it isn't preceded by a whitespace character."
Or you can use a negative lookbehind assertion:
regex = re.compile(r'(?<!\s)\s')
which might be slightly nicer (as suggested in the comments), and should be relatively easy to figure out how it works since it is very similar to the above.

Categories

Resources