Remove words with spaces or "-" in them Python

Remove words with spaces or "-" in them Python - python

This is an extension to the question here
Now as in the linked question, the answer used a space? as a regex pattern to match a string with a space or no space in it.
The Problem Statement:
I have a string and an array of phrases.
input_string = 'alice is a character from a fairy tale that lived in a wonder land. A character about whome no-one knows much about'
phrases_to_remove = ['wonderland', 'character', 'noone']
Now what I want to do is to remove the last occurrences of the words in the array phrases_to_remove from the input_string.
output_string = 'alice is a character from a fairy tale that lived in a. A about whome knows much about'
Notice: the words to remove may or may not occur in the string and if they do, they may occur in either the same form {'wonderland' or 'character', 'noone'} or they may occur with a space or a hyphen (-) in between the words e.g. wonder land, no-one, character.
The issue with the code is, I can't remove the words that have a space or a - mismatch. For example wonder land and wonderland and wonder-land.
I tried the (-)?|( )? as a regex but couldn't get it to work.
I need help

The problem with your regex is grouping. Using (-)?|( )? as a separator does not do what you think it does.
Consider what happens when the list of words is a,b:
>>> regex = "(-)?|( )?".join(["a", "b"])
>>> regex
'a(-)?|( )?b'
You'd like this regex to match ab or a b or a-b, but clearly it does not do that. It matches a, a-, b or <space>b instead!
>>> re.match(regex, 'a')
<_sre.SRE_Match object at 0x7f68c9f3b690>
>>> re.match(regex, 'a-')
<_sre.SRE_Match object at 0x7f68c9f3b718>
>>> re.match(regex, 'b')
<_sre.SRE_Match object at 0x7f68c9f3b690>
>>> re.match(regex, ' b')
<_sre.SRE_Match object at 0x7f68c9f3b718>
To fix this you can enclose the separator in its own group: ([- ])?.
If you also want to match words like wonder - land (i.e. where there are spaces before/after the hyphen) you should use the following (\s*-?\s*)?.

Since you don't know where the separations can be, you could generate a regex made of ORed regexes (using word boundaries to avoid matching sub-words).
Those regexes would alternate the letters of the word and [\s\-]* (matching zero to several occurrences of "space" or "dash") using str.join on each character
import re
input_string = 'alice is a character from a fairy tale that lived in a wonder - land. A character about whome no one knows much about'
phrases_to_remove = ['wonderland', 'character', 'noone']
the_regex = "|".join(r"\b{}\b".format('[\s\-]*'.join(x)) for x in phrases_to_remove)
Now to handle the "replace everything but the first occurrence" part: let's define an object which will replace everything but the first match (using an internal counter)
class Replacer:
def __init__(self):
self.__counter = 0
def replace(self,m):
if self.__counter:
return ""
else:
self.__counter += 1
return m.group(0)
now pass the replace method to re.sub:
print(re.sub(the_regex,Replacer().replace,input_string))
result:
alice is a character from a fairy tale that lived in a . A about whome knows much about
(the generated regex is pretty complex BTW: \bw[\s\-]*o[\s\-]*n[\s\-]*d[\s\-]*e[\s\-]*r[\s\-]*l[\s\-]*a[\s\-]*n[\s\-]*d\b|\bc[\s\-]*h[\s\-]*a[\s\-]*r[\s\-]*a[\s\-]*c[\s\-]*t[\s\-]*e[\s\-]*r\b|\bn[\s\-]*o[\s\-]*o[\s\-]*n[\s\-]*e\b)

You can use one at a time:
For space:
For '-' :
^[ \t]+
#"[^0-9a-zA-Z]+

Related

python findall regex expression

I got a long string and i need to find words which contain the character 'd' and afterwards the character 'e'.
l=[" xkn59438","yhdck2","eihd39d9","chdsye847","hedle3455","xjhd53e","45da","de37dp"]
b=' '.join(l)
runs1=re.findall(r"\b\w?d.*e\w?\b",b)
print(runs1)
\b is the boundary of the word, which follows with any char (\w?) and etc.
I get an empty list.

You can massively simplify your solution by applying a regex based search on each string individually.
>>> p = re.compile('d.*e')
>>> list(filter(p.search, l))
Or,
>>> [x for x in l if p.search(x)]
['chdsye847', 'hedle3455', 'xjhd53e', 'de37dp']
Why didn't re.findall work? You were searching one large string, and your greedy match in the middle was searching across strings. The fix would've been
>>> re.findall(r"\b\S*d\S*e\S*", ' '.join(l))
['chdsye847', 'hedle3455', 'xjhd53e', 'de37dp']
Using \S to match anything that is not a space.

You can filter the result :
import re
l=[" xkn59438","yhdck2","eihd39d9","chdsye847","hedle3455","xjhd53e","45da","de37dp"]
pattern = r'd.*?e'
print(list(filter(lambda x:re.search(pattern,x),l)))
output:
['chdsye847', 'hedle3455', 'xjhd53e', 'de37dp']

Something like this maybe
\b\w*d\w*e\w*
Note that you can probably remove the word boundary here because
the first \w guarantees a word boundary before.
The same \w*d\w*e\w*

Replace single quotes with double with exclusion of some elements

I want to replace all single quotes in the string with double with the exception of occurrences such as "n't", "'ll", "'m" etc.
input="the stackoverflow don\'t said, \'hey what\'"
output="the stackoverflow don\'t said, \"hey what\""
Code 1:(#https://stackoverflow.com/users/918959/antti-haapala)
def convert_regex(text):
return re.sub(r"(?<!\w)'(?!\w)|(?<!\w)'(?=\w)|(?<=\w)'(?!\w)", '"', text)
There are 3 cases: ' is NOT preceded and is NOT followed by a alphanumeric character; or is not preceded, but followed by an alphanumeric character; or is preceded and not followed by an alphanumeric character.
Issue: That doesn't work on words that end in an apostrophe, i.e.
most possessive plurals, and it also doesn't work on informal
abbreviations that start with an apostrophe.
Code 2:(#https://stackoverflow.com/users/953482/kevin)
def convert_text_func(s):
c = "_" #placeholder character. Must NOT appear in the string.
assert c not in s
protected = {word: word.replace("'", c) for word in ["don't", "it'll", "I'm"]}
for k,v in protected.iteritems():
s = s.replace(k,v)
s = s.replace("'", '"')
for k,v in protected.iteritems():
s = s.replace(v,k)
return s
Too large set of words to specify, as how can one specify persons' etc.
Please help.
Edit 1:
I am using #anubhava's brillant answer. I am facing this issue. Sometimes, there language translations which the approach fail.
Code=
text=re.sub(r"(?<!s)'(?!(?:t|ll|e?m|s|d|ve|re|clock)\b)", '"', text)
Problem:
In text, 'Kumbh melas' melas is a Hindi to English translation not plural possessive nouns.
Input="Similar to the 'Kumbh melas', celebrated by the banks of the holy rivers of India,"
Output=Similar to the "Kumbh melas', celebrated by the banks of the holy rivers of India,
Expected Output=Similar to the "Kumbh melas", celebrated by the banks of the holy rivers of India,
I am looking maybe to add a condition that somehow fixes it. Human-level intervention is the last option.
Edit 2:
Naive and long approach to fix:
def replace_translations(text):
d = enchant.Dict("en_US")
words=tokenize_words(text)
punctuations=[x for x in string.punctuation]
for i,word in enumerate(words):
print i,word
if(i!=len(words) and word not in punctuations and d.check(word)==False and words[i+1]=="'"):
text=text.replace(words[i]+words[i+1],words[i]+"\"")
return text
Are there any corner cases I am missing or are there any better approaches?

First attempt
You can also use this regex:
(?:(?<!\w)'((?:.|\n)+?'?)'(?!\w))
DEMO IN REGEX101
This regex match whole sentence/word with both quoting marks, from beginning and end, but also campure the content of quotation inside group nr 1, so you can replace matched part with "\1".
(?<!\w) - negative lookbehind for non-word character, to exclude words like: "you'll", etc., but to allow the regex to match quatations after characters like \n,:,;,. or -,etc. The assumption that there will always be a whitespace before quotation is risky.
' - single quoting mark,
(?:.|\n)+?'?) - non capturing group: one or more of any character or
new line (to match multiline sentences) with lazy quantifire (to avoid
matching from first to last single quoting mark), followed by
optional single quoting sing, if there would be two in a row
'(?!\w) - single quotes, followed by non-word character, to exclude
text like "i'm", "you're" etc. where quoting mark is beetwen words,
The s' case
However it still has problem with matching sentences with apostrophes occurs after word ending with s, like: 'the classes' hours'. I think it is impossible to distinguish with regex when s followed by ' should be treated as end of quotation, or as or s with apostrophes. But I figured out a kind of limited work around for this problem, with regex:
(?:(?<!\w)'((?:.|\n)+?'?)(?:(?<!s)'(?!\w)|(?<=s)'(?!([^']|\w'\w)+'(?!\w))))
DEMO IN REGEX101
PYTHON IMPLEMENTATION
with additional alternative for cases with s': (?<!s)'(?!\w)|(?<=s)'(?!([^']|\w'\w)+'(?!\w) where:
(?<!s)'(?!\w) - if there is no s before ', match as regex above (first attempt),
(?<=s)'(?!([^']|\w'\w)+'(?!\w) - if there is s before ', end a match on this ' only if there is no other ' followed by non-word
character in following text, before end or before another ' (but only ' preceded by letter other than s, or opening of next quotaion). The \w'\w is to include in such match a ' wich are between letters, like in i'm, etc.
this regex should match wrong only it there is couple s' cases in a row. Still, it is far from perfect solution.
Flaws of \w
Also, using \w there is always chance that ' would occur after sybol or non-[a-zA-Z_0-9] but still letter character, like some local language character, and then it will be treated as beginning of a quatation. It could be avoided by replacing (?<!\w) and (?!\w) with (?<!\p{L}) and (?!\p{L}) or something like (?<=^|[,.?!)\s]), etc., positive lookaround for characters wich can occour in sentence before quatation. However a list could be quite long.

You can use:
input="I'm one of the persons' stackoverflow don't th'em said, 'hey what' I'll handle it."
print re.sub(r"(?<!s)'(?!(?:t|ll|e?m)\b)", '"', input)
Output:
I'm one of the persons' stackoverflow don't th'em said, "hey what" I'll handle it.
RegEx Demo

Try this: you can use this regex ((?<=\s)'([^']+)'(?=\s)) and replace with "\2"
import re
p = re.compile(ur'((?<=\s)\'([^\']+)\'(?=\s))')
test_str = u"I'm one of the persons' stackoverflow don't th'em said, 'hey what' I'll handle it."
subst = u"\"\2\""
result = re.sub(p, subst, test_str)
Output
I'm one of the persons' stackoverflow don't th'em said, "hey what" I'll handle it.
Demo

Here is a non-regex way of doing it
text="the stackoverflow don't said, 'hey what'"
out = []
for i, j in enumerate(text):
if j == '\'':
if text[i-1:i+2] == "n't" or text[i:i+3] == "'ll" or text[i:i+3] == "'m":
out.append(j)
else:
out.append('"')
else:
out.append(j)
print ''.join(out)
gives as an output
the stackoverflow don't said, "hey what"
Of course, you can improve the exclusion list to not have to use manually check each exclusion...

Here is another possible way of doing it:
import re
text = "I'm one of the persons' stackoverflow don't th'em said, 'hey what' I'll handle it."
print re.sub("((?<!s)'(?!\w+)|(\s+'))", '"', text)
I have tried to avoid the need for special cases, it gives:
I'm one of the persons' stackoverflow don't th'em said,"hey what" I'll handle it.

How to list all Unicode strings starting and ending with a particular character?

I am trying to list all the words have particular ending and starting.
This is ണ് my ending character and വി is my starting character.
this is my input
പാമോലിന്‍ കേസിന്റെ വിചാരണ നടപടികള്‍ ഹൈക്കോടതി രണ്ടുമാസത്തേക്ക് സ്‌റ്റേചെയ്തു. കേസ് പിന്‍വലിക്കണമെന്ന ആവശ്യം നിരസിച്ച തൃശ്ശൂര്‍ വിജിലന്‍സ് കോടതി ഉത്തരവിനെതിരെ വിജിലന്‍സ് സമര്‍പ്പിച്ച ഹര്‍ജിയിലാണ് ഇടക്കാല ഉത്തരവ്.
The expected output is
വിചാരണ
How can I write regular expression for it ?
re.findall(ur'\bവി\w+ണ\b', inputtext, flags=re.UNICODE) won´t work
I still don´t understand why its not working like English, Please add this fact to the answer so that I could get better understanding of the problem

Your input text is full of a mix of word and non-word characters, so the only way to determine a word boundary is to look behind and ahead for spaces:
re.findall(ur'(?<![^ ])വി[^ ]+ണ്?(?![^ ])', inputtext, flags=re.UNICODE)
where inputtext is a Unicode value. The (?<!...) and (?!...) are negative lookbehind and lookahead assertions; the match locations in the text that are not preceded or followed by a non-space character, respectively.
Within your boundary text, we match non-spaces as well.
This matches your expected input:
>>> print re.findall(ur'(?<![^ ])വി[^ ]+ണ്?(?![^ ])', inputtext, flags=re.UNICODE)[0]
വിചാരണ

... or, if you want something more verbal
original_list = ('abc', 'ccbd', 'abbc')
filtered = tuple(filter(lambda x: x.startswith('a') and x.endswith('c'), original_list))
filtered
('abc', 'abbc')
but it definately doesn't answer your question.

How to match one character word?

How do I match only words of character length one? Or do I have to check the length of the match after I performed the match operation? My filter looks like this:
sw = r'\w+,\s+([A-Za-z]){1}
So it should match
rs =re.match(sw,'Herb, A')
But shouldn't match
rs =re.match(sw,'Herb, Abc')

If you use \b\w\b you will only match one character of type word. So your expression would be
sw = r'\w+,\s+\w\b'
(since \w is preceded by at least one \s you don't need the first \b)
Verification:
>>> sw = r'\w+,\s+\w\b'
>>> print re.match(sw,'Herb, A')
<_sre.SRE_Match object at 0xb7242058>
>>> print re.match(sw,'Herb, Abc')
None

You can use
(?<=\s|^)\p{L}(?=[\s,.!?]|$)
which will match a single letter that is preceded and followed either by a whitespace character or the end of the string. The lookahead is a little augmented by punctuation marks as well ... this all depends a bit on your input data. You could also do a lookahead on a non-letter, but that begs the question whether “a123” is really a one-letter word. Or “I'm”.

Confusing Behaviour of regex in Python

I'm trying to match a specific pattern using the re module in python.
I wish to match a full sentence (More correctly I would say that they are alphanumeric string sequences separated by spaces and/or punctuation)
Eg.
"This is a regular sentence."
"this is also valid"
"so is This ONE"
I'm tried out of various combinations of regular expressions but I am unable to grasp the working of the patterns properly, with each expression giving me a different yet inexplicable result (I do admit I am a beginner, but still).
I'm tried:
"((\w+)(\s?))*"
To the best of my knowledge this should match one or more alpha alphanumerics greedily followed by either one or no white-space character and then it should match this entire pattern greedily. This is not what it seems to do, so clearly I am wrong but I would like to know why. (I expected this to return the entire sentence as the result)
The result I get for the first sample string mentioned above is [('sentence', 'sentence', ''), ('', '', ''), ('', '', ''), ('', '', '')].
"(\w+ ?)*"
I'm not even sure how this one should work. The official documentation(python help('re')) says that the ,+,? Match x or x (greedy) repetitions of the preceding RE.
In such a case is simply space the preceding RE for '?' or is '\w+ ' the preceding RE? And what will be the RE for the '' operator? The output I get with this is ['sentence'].
Others such as "(\w+\s?)+)" ; "((\w*)(\s??)) etc. which are basically variation of the same idea that the sentence is a set of alpha numerics followed by a single/finite number of white spaces and this pattern is repeated over and over.
Can someone tell me where I go wrong and why, and why the above expressions do not work the way I was expecting them to?
P.S I eventually got "[ \w]+" to work for me but With this I cannot limit the number of white-space characters in continuation.

Your reasoning about the regex is correct, your problem is coming from using capturing groups with *. Here's an alternative:
>>> s="This is a regular sentence."
>>> import re
>>> re.findall(r'\w+\s?', s)
['This ', 'is ', 'a ', 'regular ', 'sentence']
In this case it might make more sense for you to use \b in order to match word boundries.
>>> re.findall(r'\w+\b', s)
['This', 'is', 'a', 'regular', 'sentence']
Alternatively you can match the entire sentence via re.match and use re.group(0) to get the whole match:
>>> r = r"((\w+)(\s?))*"
>>> s = "This is a regular sentence."
>>> import re
>>> m = re.match(r, s)
>>> m.group(0)
'This is a regular sentence'

Here's an awesome Regular Expression tutorial website:
http://regexone.com/
Here's a Regular Expression that will match the examples given:
([a-zA-Z0-9,\. ]+)

Why do you want to limit the number of white space character in continuation? Because a sentence can have any number of words (sequences of alphanumeric characters) and spaces in a row, but rather a sentence is the area of text that ends with a punctuation mark or rather something that is not in the above sequence including white space.
([a-zA-Z0-9\s])*
The above regex will match a sentence wherein it is a series or spaces in series zero or more times. You can refine it to be the following though:
([a-zA-Z0-9])([a-zA-Z0-9\s])*
Which simply states that the above sequence must be prefaced with a alphanumeric character.
Hope this is what you were looking for.

Maybe this will help:
import re
source = """
This is a regular sentence.
this is also valid
so is This ONE
how about this one followed by this one
"""
re_sentence = re.compile(r'[^ \n.].*?(\.|\n| +)')
def main():
i = 0
for s in re_sentence.finditer(source):
print "%d:%s" % (i, s.group(0))
i += 1
if __name__ == '__main__':
main()
I am using alternation in the expression (\.|\n| +) to describe the end-of-sentence condition. Note the use of two spaces in the third alternation. The second space has the '+' meta-character so that two or more spaces in a row will be an end-of-sentence.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove words with spaces or "-" in them Python - python

You can use one at a time: For space: For '-' : ^[ \t]+ #"[^0-9a-zA-Z]+

Related

python findall regex expression

Replace single quotes with double with exclusion of some elements

How to list all Unicode strings starting and ending with a particular character?

How to match one character word?

Confusing Behaviour of regex in Python

Categories

Resources