Exclude matching the period character in a [\W\d]+ regex - python

I'd like to remove all from a string except alphabetic characters and periods.
I made the below function in python. How would I extend the regex so periods are NOT stripped from the string? This needs to work for unicode strings.
def normalize(self, text):
text = re.sub(ur"(?u)[\W\d]+", ' ', text)
print text
return text

change the semantics from 'strip everything in this group' to 'strip everything that's not in this group' and use:
text = re.sub(ur"(?u)[^a-zA-Z\.]+", ' ', text)
update
i don't think the above mentioned solution will work with all unicode alphabet.
the answers here offer alternative modules to the builtin re that support unicode letter groups.
another option is combining the two approaches:
>>> text = '1234abcd.à!##$'
>>> re.sub(ur'(?u)([^\w\.]|\d)+',' ',text)
' abcd.\xc3 '

Related

Splitting words before and after '/' in using Regex

Have the following string:
text = 'interesting/practical/useful'
I need to split it like interesting / practical / useful using Regex.
I'm trying this
newText = re.sub(r'[a-zA-Z](/)[a-zA-Z]', ' \g<0> ', text)
but it returns
interestin g/p ractica l/u seful
P.S. I have other texts in the Corpus that also have forward slashes that shouldn't be altered due to this operation, which is my I'm looking for regex patterns specifically with characters before and after the '/'.
Use lookarounds for the letters before and after the /, so they're not included in the match.
newText = re.sub(r'(?<=[a-zA-Z])/(?=[a-zA-Z])', ' \g<0> ', text)
I'm used to do it this way:
newText = re.sub(r'([a-zA-Z])/([a-zA-Z])', '\1 / \2', text)
\1 and \2 are the references on the first and second founded expressions in the brakets ()
In plain English it means: look for any letter, a slash, a letter and change them with the founded letter, a space, a slash, a space and the second founded letter.

Why can I not use re.sub to replace a group?

My goal is to find a group in a string using regex and replace it with a space.
The group I am looking to find is a group of symbols only when they fall between strings. When I use re.findall() it works exactly as expected
word = 'This##Is # A # Test#'
print(word)
re.findall(r"[a-zA-Z\s]*([\$\#\%\!\s]*)[a-zA-Z]",word)
>>> ['##', '# ', '# ', '']
But when I use re.sub(), instead of replacing the group, it replaces the entire regex.
x = re.sub(r"[a-zA-Z\s]*([\$\#\%\!\s]*)[a-zA-Z]",r' ',word)
print(x)
>>> ' #'
How can I use regular expressions to replace ONLY the group? The outcome I expect is:
'This Is A Test#'
First, there's no need to escape every "magic" character within a character class, [$#%!\s]* is equally fine and much more readable.
Second, matching (i.e. retrieving) is different from replacing and you could use backreferences to achieve your goal.
Third, if you only want to have # at the end, you could help yourself with a much easier expression:
(?:[\s#](?!\Z))+
Which would then need to be replaced by a space, see a demo on regex101.com.
In Python this could be:
import re
string = "This##Is # A # Test#"
rx = re.compile(r'(?:[\s#](?!\Z))+')
new_string = rx.sub(' ', string)
print(new_string)
# This Is A Test#
You can group the portions of the pattern you want to retain and use backreferences in your replacement string instead:
x = re.sub(r"([a-zA-Z\s]*)[\$\#\%\!\s]*([a-zA-Z])", r'\1 \2', word)
The problem is that your regex matches the wrong thing entirely.
x = re.sub(r'\b[$#%!\s]+\b', ' ', word)

Detect latin characters in regex

I want to apply a regex on a Latin text, and I followed the solution in this question: How to account for accent characters for regex in Python?, where they suggest to add a # character before the regex.
def clean_str(string):
string = re.sub(r"#(#[a-zA-Z_0-9]+)", " ", string, re.UNICODE)
string = re.sub(r'#([^a-zA-Z0-9#])', r' \1 ', string, re.UNICODE)
string = re.sub(r'#([^a-zA-Z0-9#])', r' ', string, re.UNICODE)
string = re.sub(r'(\s{2,})', ' ', string, re.UNICODE)
return string.lower().strip()
My problem is, the regex work in detecting the latin characters, but nothing is applied from the regex set on the text.
example:
if I have a text like "#aaa bbb các. ddd".
it should be like "bbb các . ddd" with space "before the DOT" and with deleting the Tag "#aaa".
But it produces the same input text!: "#aaa bbb các. ddd"
Did I miss something?
You have several issues in the current code:
To match any Unicode word char, use \w (rather than [A-Za-z0-9_]) with a Unicode flag
When using a re.U with re.sub, remember to either use the count argument (set it to 0 to match all occurrences) before the flag, or just use flags=re.U/ flags=re.UNICODE
To match any non-word char but a whitespace, you may use [^\w\s]
When you want to replace with a whole match, you do not have to wrap the whole pattern with (...), just make sure you use \g<0> backreference in the replacement pattern.
See an updated method to clean the strings:
>>> def clean_str(s):
... s = re.sub(r'#\w+', ' ', s, flags=re.U)
... s = re.sub(r'[^\w\s]', r' \g<0>', s, flags=re.U)
... s = re.sub(r'\s{2,}', ' ', s, flags=re.U)
... return s.lower().strip()
...
>>> print(clean_str(s))

Replace single quotes with double with exclusion of some elements

I want to replace all single quotes in the string with double with the exception of occurrences such as "n't", "'ll", "'m" etc.
input="the stackoverflow don\'t said, \'hey what\'"
output="the stackoverflow don\'t said, \"hey what\""
Code 1:(#https://stackoverflow.com/users/918959/antti-haapala)
def convert_regex(text):
return re.sub(r"(?<!\w)'(?!\w)|(?<!\w)'(?=\w)|(?<=\w)'(?!\w)", '"', text)
There are 3 cases: ' is NOT preceded and is NOT followed by a alphanumeric character; or is not preceded, but followed by an alphanumeric character; or is preceded and not followed by an alphanumeric character.
Issue: That doesn't work on words that end in an apostrophe, i.e.
most possessive plurals, and it also doesn't work on informal
abbreviations that start with an apostrophe.
Code 2:(#https://stackoverflow.com/users/953482/kevin)
def convert_text_func(s):
c = "_" #placeholder character. Must NOT appear in the string.
assert c not in s
protected = {word: word.replace("'", c) for word in ["don't", "it'll", "I'm"]}
for k,v in protected.iteritems():
s = s.replace(k,v)
s = s.replace("'", '"')
for k,v in protected.iteritems():
s = s.replace(v,k)
return s
Too large set of words to specify, as how can one specify persons' etc.
Please help.
Edit 1:
I am using #anubhava's brillant answer. I am facing this issue. Sometimes, there language translations which the approach fail.
Code=
text=re.sub(r"(?<!s)'(?!(?:t|ll|e?m|s|d|ve|re|clock)\b)", '"', text)
Problem:
In text, 'Kumbh melas' melas is a Hindi to English translation not plural possessive nouns.
Input="Similar to the 'Kumbh melas', celebrated by the banks of the holy rivers of India,"
Output=Similar to the "Kumbh melas', celebrated by the banks of the holy rivers of India,
Expected Output=Similar to the "Kumbh melas", celebrated by the banks of the holy rivers of India,
I am looking maybe to add a condition that somehow fixes it. Human-level intervention is the last option.
Edit 2:
Naive and long approach to fix:
def replace_translations(text):
d = enchant.Dict("en_US")
words=tokenize_words(text)
punctuations=[x for x in string.punctuation]
for i,word in enumerate(words):
print i,word
if(i!=len(words) and word not in punctuations and d.check(word)==False and words[i+1]=="'"):
text=text.replace(words[i]+words[i+1],words[i]+"\"")
return text
Are there any corner cases I am missing or are there any better approaches?
First attempt
You can also use this regex:
(?:(?<!\w)'((?:.|\n)+?'?)'(?!\w))
DEMO IN REGEX101
This regex match whole sentence/word with both quoting marks, from beginning and end, but also campure the content of quotation inside group nr 1, so you can replace matched part with "\1".
(?<!\w) - negative lookbehind for non-word character, to exclude words like: "you'll", etc., but to allow the regex to match quatations after characters like \n,:,;,. or -,etc. The assumption that there will always be a whitespace before quotation is risky.
' - single quoting mark,
(?:.|\n)+?'?) - non capturing group: one or more of any character or
new line (to match multiline sentences) with lazy quantifire (to avoid
matching from first to last single quoting mark), followed by
optional single quoting sing, if there would be two in a row
'(?!\w) - single quotes, followed by non-word character, to exclude
text like "i'm", "you're" etc. where quoting mark is beetwen words,
The s' case
However it still has problem with matching sentences with apostrophes occurs after word ending with s, like: 'the classes' hours'. I think it is impossible to distinguish with regex when s followed by ' should be treated as end of quotation, or as or s with apostrophes. But I figured out a kind of limited work around for this problem, with regex:
(?:(?<!\w)'((?:.|\n)+?'?)(?:(?<!s)'(?!\w)|(?<=s)'(?!([^']|\w'\w)+'(?!\w))))
DEMO IN REGEX101
PYTHON IMPLEMENTATION
with additional alternative for cases with s': (?<!s)'(?!\w)|(?<=s)'(?!([^']|\w'\w)+'(?!\w) where:
(?<!s)'(?!\w) - if there is no s before ', match as regex above (first attempt),
(?<=s)'(?!([^']|\w'\w)+'(?!\w) - if there is s before ', end a match on this ' only if there is no other ' followed by non-word
character in following text, before end or before another ' (but only ' preceded by letter other than s, or opening of next quotaion). The \w'\w is to include in such match a ' wich are between letters, like in i'm, etc.
this regex should match wrong only it there is couple s' cases in a row. Still, it is far from perfect solution.
Flaws of \w
Also, using \w there is always chance that ' would occur after sybol or non-[a-zA-Z_0-9] but still letter character, like some local language character, and then it will be treated as beginning of a quatation. It could be avoided by replacing (?<!\w) and (?!\w) with (?<!\p{L}) and (?!\p{L}) or something like (?<=^|[,.?!)\s]), etc., positive lookaround for characters wich can occour in sentence before quatation. However a list could be quite long.
You can use:
input="I'm one of the persons' stackoverflow don't th'em said, 'hey what' I'll handle it."
print re.sub(r"(?<!s)'(?!(?:t|ll|e?m)\b)", '"', input)
Output:
I'm one of the persons' stackoverflow don't th'em said, "hey what" I'll handle it.
RegEx Demo
Try this: you can use this regex ((?<=\s)'([^']+)'(?=\s)) and replace with "\2"
import re
p = re.compile(ur'((?<=\s)\'([^\']+)\'(?=\s))')
test_str = u"I'm one of the persons' stackoverflow don't th'em said, 'hey what' I'll handle it."
subst = u"\"\2\""
result = re.sub(p, subst, test_str)
Output
I'm one of the persons' stackoverflow don't th'em said, "hey what" I'll handle it.
Demo
Here is a non-regex way of doing it
text="the stackoverflow don't said, 'hey what'"
out = []
for i, j in enumerate(text):
if j == '\'':
if text[i-1:i+2] == "n't" or text[i:i+3] == "'ll" or text[i:i+3] == "'m":
out.append(j)
else:
out.append('"')
else:
out.append(j)
print ''.join(out)
gives as an output
the stackoverflow don't said, "hey what"
Of course, you can improve the exclusion list to not have to use manually check each exclusion...
Here is another possible way of doing it:
import re
text = "I'm one of the persons' stackoverflow don't th'em said, 'hey what' I'll handle it."
print re.sub("((?<!s)'(?!\w+)|(\s+'))", '"', text)
I have tried to avoid the need for special cases, it gives:
I'm one of the persons' stackoverflow don't th'em said,"hey what" I'll handle it.

Python (2.7) - Replacing multiple patterns in a string using re

I am trying to think of a more elegant way of replacing multiple patterns in a given string using re in relation to a little problem, which is to remove from a given string all substrings consisting of more than two spaces and also all substrings where a letter starts after a period without any space. So the sentence
'This is a strange sentence. There are too many spaces.And.Some periods are not. placed properly.'
should be corrected to:
'This is a strange sentence. There are too many spaces. And. Some periods are not. placed properly.'
My solution, below, seems a bit messy. I was wondering whether there was a nicer way of doing this, as in a one-liner regex.
def correct( astring ):
import re
bstring = re.sub( r' +', ' ', astring )
letters = [frag.strip( '.' ) for frag in re.findall( r'\.\w', bstring )]
for letter in letters:
bstring = re.sub( r'\.{}'.format( letter ), '. {}'.format( letter ), bstring )
return bstring
s = 'This is a strange sentence. There are too many spaces.And.Some periods are not. placed properly.'
print(re.sub("\s+"," ",s).replace(".",". ").rstrip())
This is a strange sentence. There are too many spaces. And. Some periods are not. placed properly.
You could use re.sub function like below. This would add exactly two spaces next to the dot except the last dot and it also replaces one or more spaces except the one after dot with a single space.
>>> s = 'This is a strange sentence. There are too many spaces.And.Some periods are not. placed properly.'
>>> re.sub(r'(?<!\.)\s+', ' ' ,re.sub(r'\.\s*(?!$)', r'. ', s))
'This is a strange sentence. There are too many spaces. And. Some periods are not. placed properly.'
OR
>>> re.sub(r'\.\s*(?!$)', r'. ', re.sub(r'\s+', ' ', s))
'This is a strange sentence. There are too many spaces. And. Some periods are not. placed properly.'
An approach without using any RegEX
>>> ' '.join(s.split()).replace('.','. ')[:-1]
'This is a strange sentence. There are too many spaces. And. Some periods are not. placed properly.'
What pure regex? Like this?
>>> import re
>>> s = 'This is a strange sentence. There are too many spaces.And.Some periods are not. placed properly.'
>>> re.sub('\s+$', '', re.sub('\s+', ' ', re.sub('\.', '. ', s)))
'This is a strange sentence. There are too many spaces. And. Some periods are not. placed properly.'

Categories

Resources