I want to replace all single quotes in the string with double with the exception of occurrences such as "n't", "'ll", "'m" etc.
input="the stackoverflow don\'t said, \'hey what\'"
output="the stackoverflow don\'t said, \"hey what\""
Code 1:(#https://stackoverflow.com/users/918959/antti-haapala)
def convert_regex(text):
return re.sub(r"(?<!\w)'(?!\w)|(?<!\w)'(?=\w)|(?<=\w)'(?!\w)", '"', text)
There are 3 cases: ' is NOT preceded and is NOT followed by a alphanumeric character; or is not preceded, but followed by an alphanumeric character; or is preceded and not followed by an alphanumeric character.
Issue: That doesn't work on words that end in an apostrophe, i.e.
most possessive plurals, and it also doesn't work on informal
abbreviations that start with an apostrophe.
Code 2:(#https://stackoverflow.com/users/953482/kevin)
def convert_text_func(s):
c = "_" #placeholder character. Must NOT appear in the string.
assert c not in s
protected = {word: word.replace("'", c) for word in ["don't", "it'll", "I'm"]}
for k,v in protected.iteritems():
s = s.replace(k,v)
s = s.replace("'", '"')
for k,v in protected.iteritems():
s = s.replace(v,k)
return s
Too large set of words to specify, as how can one specify persons' etc.
Please help.
Edit 1:
I am using #anubhava's brillant answer. I am facing this issue. Sometimes, there language translations which the approach fail.
Code=
text=re.sub(r"(?<!s)'(?!(?:t|ll|e?m|s|d|ve|re|clock)\b)", '"', text)
Problem:
In text, 'Kumbh melas' melas is a Hindi to English translation not plural possessive nouns.
Input="Similar to the 'Kumbh melas', celebrated by the banks of the holy rivers of India,"
Output=Similar to the "Kumbh melas', celebrated by the banks of the holy rivers of India,
Expected Output=Similar to the "Kumbh melas", celebrated by the banks of the holy rivers of India,
I am looking maybe to add a condition that somehow fixes it. Human-level intervention is the last option.
Edit 2:
Naive and long approach to fix:
def replace_translations(text):
d = enchant.Dict("en_US")
words=tokenize_words(text)
punctuations=[x for x in string.punctuation]
for i,word in enumerate(words):
print i,word
if(i!=len(words) and word not in punctuations and d.check(word)==False and words[i+1]=="'"):
text=text.replace(words[i]+words[i+1],words[i]+"\"")
return text
Are there any corner cases I am missing or are there any better approaches?
First attempt
You can also use this regex:
(?:(?<!\w)'((?:.|\n)+?'?)'(?!\w))
DEMO IN REGEX101
This regex match whole sentence/word with both quoting marks, from beginning and end, but also campure the content of quotation inside group nr 1, so you can replace matched part with "\1".
(?<!\w) - negative lookbehind for non-word character, to exclude words like: "you'll", etc., but to allow the regex to match quatations after characters like \n,:,;,. or -,etc. The assumption that there will always be a whitespace before quotation is risky.
' - single quoting mark,
(?:.|\n)+?'?) - non capturing group: one or more of any character or
new line (to match multiline sentences) with lazy quantifire (to avoid
matching from first to last single quoting mark), followed by
optional single quoting sing, if there would be two in a row
'(?!\w) - single quotes, followed by non-word character, to exclude
text like "i'm", "you're" etc. where quoting mark is beetwen words,
The s' case
However it still has problem with matching sentences with apostrophes occurs after word ending with s, like: 'the classes' hours'. I think it is impossible to distinguish with regex when s followed by ' should be treated as end of quotation, or as or s with apostrophes. But I figured out a kind of limited work around for this problem, with regex:
(?:(?<!\w)'((?:.|\n)+?'?)(?:(?<!s)'(?!\w)|(?<=s)'(?!([^']|\w'\w)+'(?!\w))))
DEMO IN REGEX101
PYTHON IMPLEMENTATION
with additional alternative for cases with s': (?<!s)'(?!\w)|(?<=s)'(?!([^']|\w'\w)+'(?!\w) where:
(?<!s)'(?!\w) - if there is no s before ', match as regex above (first attempt),
(?<=s)'(?!([^']|\w'\w)+'(?!\w) - if there is s before ', end a match on this ' only if there is no other ' followed by non-word
character in following text, before end or before another ' (but only ' preceded by letter other than s, or opening of next quotaion). The \w'\w is to include in such match a ' wich are between letters, like in i'm, etc.
this regex should match wrong only it there is couple s' cases in a row. Still, it is far from perfect solution.
Flaws of \w
Also, using \w there is always chance that ' would occur after sybol or non-[a-zA-Z_0-9] but still letter character, like some local language character, and then it will be treated as beginning of a quatation. It could be avoided by replacing (?<!\w) and (?!\w) with (?<!\p{L}) and (?!\p{L}) or something like (?<=^|[,.?!)\s]), etc., positive lookaround for characters wich can occour in sentence before quatation. However a list could be quite long.
You can use:
input="I'm one of the persons' stackoverflow don't th'em said, 'hey what' I'll handle it."
print re.sub(r"(?<!s)'(?!(?:t|ll|e?m)\b)", '"', input)
Output:
I'm one of the persons' stackoverflow don't th'em said, "hey what" I'll handle it.
RegEx Demo
Try this: you can use this regex ((?<=\s)'([^']+)'(?=\s)) and replace with "\2"
import re
p = re.compile(ur'((?<=\s)\'([^\']+)\'(?=\s))')
test_str = u"I'm one of the persons' stackoverflow don't th'em said, 'hey what' I'll handle it."
subst = u"\"\2\""
result = re.sub(p, subst, test_str)
Output
I'm one of the persons' stackoverflow don't th'em said, "hey what" I'll handle it.
Demo
Here is a non-regex way of doing it
text="the stackoverflow don't said, 'hey what'"
out = []
for i, j in enumerate(text):
if j == '\'':
if text[i-1:i+2] == "n't" or text[i:i+3] == "'ll" or text[i:i+3] == "'m":
out.append(j)
else:
out.append('"')
else:
out.append(j)
print ''.join(out)
gives as an output
the stackoverflow don't said, "hey what"
Of course, you can improve the exclusion list to not have to use manually check each exclusion...
Here is another possible way of doing it:
import re
text = "I'm one of the persons' stackoverflow don't th'em said, 'hey what' I'll handle it."
print re.sub("((?<!s)'(?!\w+)|(\s+'))", '"', text)
I have tried to avoid the need for special cases, it gives:
I'm one of the persons' stackoverflow don't th'em said,"hey what" I'll handle it.
Related
I want to split '10.1 This is a sentence. Another sentence.'
as ['10.1 This is a sentence', 'Another sentence'] and split '10.1. This is a sentence. Another sentence.' as ['10.1. This is a sentence', 'Another sentence']
I have tried
s.split(r'\D.\D')
It doesn't work, how can this be solved?
If you plan to split a string on a . char that is not preceded or followed with a digit, and that is not at the end of the string a splitting approach might work for you:
re.split(r'(?<!\d)\.(?!\d|$)', text)
See the regex demo.
If your strings can contain more special cases, you could use a more customizable extracting approach:
re.findall(r'(?:\d+(?:\.\d+)*\.?|[^.])+', text)
See this regex demo. Details:
(?:\d+(?:\.\d+)*\.?|[^.])+ - a non-capturing group that matches one or more occurrences of
\d+(?:\.\d+)*\.? - one or more digits (\d+), then zero or more sequences of . and one or more digits ((?:\.\d+)*) and then an optional . char (\.?)
| - or
[^.] - any char other than a . char.
All sentences (except the very last one) end with a period followed by space, so split on that. Worrying about the clause number is backwards. You could potentially find all kinds of situations that you DON'T want, but it is generally much easier to describe the situation that you DO want. In this case '. ' is that situation.
import re
doc = '10.1 This is a sentence. Another sentence.'
def sentences(doc):
#split all sentences
s = re.split(r'\.\s+', doc)
#remove empty index or remove period from absolute last index, if present
if s[-1] == '':
s = s[0:-1]
elif s[-1].endswith('.'):
s[-1] = s[-1][:-1]
#return sentences
return s
print(sentences(doc))
The way I structured my regex it should also eliminate arbitrary whitespace between paragraphs.
You have multiple issues:
You're not using re.split(), you're using str.split().
You haven't escaped the ., use \. instead.
You're not using lookahead and lookbehinds so your 3 characters are gone.
Fixed code:
>>> import re
>>> s = '10.1 This is a sentence. Another sentence.'
>>> re.split(r"(?<=\D\.)(?=\D)", s)
['10.1 This is a sentence.', ' Another sentence.']
Basically, (?<=\D\.) finds a position right after a . that has a non-digit character. (?=\D) then makes sure there's a non digit after the current position. When everything applies, it splits correctly.
I am trying to split/parse comments which have strings, numbers and emojis and I want to do a generic code that can parse strings in different parts depending on the existence of an emoji in the comment.
For example:
comment_1 = "This is :) my comment :O"
comment_2 = ">:O Another comment to :v parse"
The output should be something like:
output_1 = ["This is", "my comment"]
output_2 = ["Another comment to", "parse"]
I have been thinking that I could do a parsing with special characters only, but maybe it will leave the "O" in ":O", or the "v" in ":v"
You may try matching on the pattern (?<!\S)\w+\S?(?: \w+\S?)*, which attempts to find any sequence of all word terms, which may end in an optional non whitespace character (such as a punctuation character).
inp = ["This is :) my comment :O", ">:O Another comment to :v parse"]
for i in inp:
matches = re.findall(r'(?<!\S)\w+\S?(?: \w+\S?)*', i)
print(matches)
This prints:
['This is', 'my comment']
['Another comment to', 'parse']
Here is an explanation of the regex pattern being used:
(?<!\S) assert that what precedes the word is either whitespace
or the start of the string
\w+ match a word
\S? followed by zero or one non whitespace character
(such as punctuation symbols)
(?: \w+\S*)* zero or more word/symbol sequences following
When splitting re.split is often useful, I would do
import re
comment_1 = "This is :) my comment :O"
comment_2 = ">:O Another comment to :v parse"
output_1 = re.split(r'\s*\S?:\S+\s*', comment_1)
output_2 = re.split(r'\s*\S?:\S+\s*', comment_2)
print(output_1)
print(output_2)
output
['This is', 'my comment', '']
['', 'Another comment to', 'parse']
Note that this differ from your required output as there is empty strs in outputs but these can be easily removed using list comprehension, e.g. [i for i in output_1 if i]. r'\s*\S?:\S+\s*' might be explained as zero or none non-whitespaces (\S?) followed by colon (;) and one or more non-whitespaces (\S+) with added leading and trailing whitespaces if present (\s*).
This is an extension to the question here
Now as in the linked question, the answer used a space? as a regex pattern to match a string with a space or no space in it.
The Problem Statement:
I have a string and an array of phrases.
input_string = 'alice is a character from a fairy tale that lived in a wonder land. A character about whome no-one knows much about'
phrases_to_remove = ['wonderland', 'character', 'noone']
Now what I want to do is to remove the last occurrences of the words in the array phrases_to_remove from the input_string.
output_string = 'alice is a character from a fairy tale that lived in a. A about whome knows much about'
Notice: the words to remove may or may not occur in the string and if they do, they may occur in either the same form {'wonderland' or 'character', 'noone'} or they may occur with a space or a hyphen (-) in between the words e.g. wonder land, no-one, character.
The issue with the code is, I can't remove the words that have a space or a - mismatch. For example wonder land and wonderland and wonder-land.
I tried the (-)?|( )? as a regex but couldn't get it to work.
I need help
The problem with your regex is grouping. Using (-)?|( )? as a separator does not do what you think it does.
Consider what happens when the list of words is a,b:
>>> regex = "(-)?|( )?".join(["a", "b"])
>>> regex
'a(-)?|( )?b'
You'd like this regex to match ab or a b or a-b, but clearly it does not do that. It matches a, a-, b or <space>b instead!
>>> re.match(regex, 'a')
<_sre.SRE_Match object at 0x7f68c9f3b690>
>>> re.match(regex, 'a-')
<_sre.SRE_Match object at 0x7f68c9f3b718>
>>> re.match(regex, 'b')
<_sre.SRE_Match object at 0x7f68c9f3b690>
>>> re.match(regex, ' b')
<_sre.SRE_Match object at 0x7f68c9f3b718>
To fix this you can enclose the separator in its own group: ([- ])?.
If you also want to match words like wonder - land (i.e. where there are spaces before/after the hyphen) you should use the following (\s*-?\s*)?.
Since you don't know where the separations can be, you could generate a regex made of ORed regexes (using word boundaries to avoid matching sub-words).
Those regexes would alternate the letters of the word and [\s\-]* (matching zero to several occurrences of "space" or "dash") using str.join on each character
import re
input_string = 'alice is a character from a fairy tale that lived in a wonder - land. A character about whome no one knows much about'
phrases_to_remove = ['wonderland', 'character', 'noone']
the_regex = "|".join(r"\b{}\b".format('[\s\-]*'.join(x)) for x in phrases_to_remove)
Now to handle the "replace everything but the first occurrence" part: let's define an object which will replace everything but the first match (using an internal counter)
class Replacer:
def __init__(self):
self.__counter = 0
def replace(self,m):
if self.__counter:
return ""
else:
self.__counter += 1
return m.group(0)
now pass the replace method to re.sub:
print(re.sub(the_regex,Replacer().replace,input_string))
result:
alice is a character from a fairy tale that lived in a . A about whome knows much about
(the generated regex is pretty complex BTW: \bw[\s\-]*o[\s\-]*n[\s\-]*d[\s\-]*e[\s\-]*r[\s\-]*l[\s\-]*a[\s\-]*n[\s\-]*d\b|\bc[\s\-]*h[\s\-]*a[\s\-]*r[\s\-]*a[\s\-]*c[\s\-]*t[\s\-]*e[\s\-]*r\b|\bn[\s\-]*o[\s\-]*o[\s\-]*n[\s\-]*e\b)
You can use one at a time:
For space:
For '-' :
^[ \t]+
#"[^0-9a-zA-Z]+
I'm trying to find an expression "K others" in a sentence "Chris and 34K others"
I tried with regular expression, but it doesn't work :(
import re
value = "Chris and 34K others"
m = re.search("(.K.others.)", value)
if m:
print "it is true"
else:
print "it is not"
Guessing that you're web-page scraping "you and 34k others liked this on Facebook", and you're wrapping "K others" in a capture group, I'll jump straight to how to get the number:
import re
value = "Chris and 34K others blah blah"
# regex describes
# a leading space, one or more characters (to catch punctuation)
# , and optional space, trailing 'K others' in any capitalisation
m = re.search("\s(\w+?)\s*K others", value, re.IGNORECASE)
if m:
captured_values = m.groups()
print "Number of others:", captured_values[0], "K"
else:
print "it is not"
Try this code on repl.it
This should also cover uppercase/lowercase K, numbers with commas (1,100K people), spaces between the number and the K, and work if there's text after 'others' or if there isn't.
You should use search rather than match unless you expect your regular expression to match at the beginning. The help string for re.match mentions that the pattern is applied at the start of the string.
If you want to match something within the string, use re.search. re.match starts at the beginning, Also, change your RegEx to: (K.others), the last . ruins the RegEx as there is nothing after, and the first . matches any character before. I removed those:
>>> bool(re.search("(K.others)", "Chris and 34K others"))
True
The RegEx (K.others) matches:
Chris and 34K others
^^^^^^^^
Opposed to (.K.others.), which matches nothing. You can use (.K.others) as well, which matches the character before:
Chris and 34K others
^^^^^^^^^
Also, you can use \s to escape space and match only whitespace characters: (K\sothers). This will literally match K, a whitespace character, and others.
Now, if you want to match all preceding and all following, try: (.+)?(K\sothers)(\s.+)?. Here's a link to repl.it. You can get the number with this.
I'm trying to match a specific pattern using the re module in python.
I wish to match a full sentence (More correctly I would say that they are alphanumeric string sequences separated by spaces and/or punctuation)
Eg.
"This is a regular sentence."
"this is also valid"
"so is This ONE"
I'm tried out of various combinations of regular expressions but I am unable to grasp the working of the patterns properly, with each expression giving me a different yet inexplicable result (I do admit I am a beginner, but still).
I'm tried:
"((\w+)(\s?))*"
To the best of my knowledge this should match one or more alpha alphanumerics greedily followed by either one or no white-space character and then it should match this entire pattern greedily. This is not what it seems to do, so clearly I am wrong but I would like to know why. (I expected this to return the entire sentence as the result)
The result I get for the first sample string mentioned above is [('sentence', 'sentence', ''), ('', '', ''), ('', '', ''), ('', '', '')].
"(\w+ ?)*"
I'm not even sure how this one should work. The official documentation(python help('re')) says that the ,+,? Match x or x (greedy) repetitions of the preceding RE.
In such a case is simply space the preceding RE for '?' or is '\w+ ' the preceding RE? And what will be the RE for the '' operator? The output I get with this is ['sentence'].
Others such as "(\w+\s?)+)" ; "((\w*)(\s??)) etc. which are basically variation of the same idea that the sentence is a set of alpha numerics followed by a single/finite number of white spaces and this pattern is repeated over and over.
Can someone tell me where I go wrong and why, and why the above expressions do not work the way I was expecting them to?
P.S I eventually got "[ \w]+" to work for me but With this I cannot limit the number of white-space characters in continuation.
Your reasoning about the regex is correct, your problem is coming from using capturing groups with *. Here's an alternative:
>>> s="This is a regular sentence."
>>> import re
>>> re.findall(r'\w+\s?', s)
['This ', 'is ', 'a ', 'regular ', 'sentence']
In this case it might make more sense for you to use \b in order to match word boundries.
>>> re.findall(r'\w+\b', s)
['This', 'is', 'a', 'regular', 'sentence']
Alternatively you can match the entire sentence via re.match and use re.group(0) to get the whole match:
>>> r = r"((\w+)(\s?))*"
>>> s = "This is a regular sentence."
>>> import re
>>> m = re.match(r, s)
>>> m.group(0)
'This is a regular sentence'
Here's an awesome Regular Expression tutorial website:
http://regexone.com/
Here's a Regular Expression that will match the examples given:
([a-zA-Z0-9,\. ]+)
Why do you want to limit the number of white space character in continuation? Because a sentence can have any number of words (sequences of alphanumeric characters) and spaces in a row, but rather a sentence is the area of text that ends with a punctuation mark or rather something that is not in the above sequence including white space.
([a-zA-Z0-9\s])*
The above regex will match a sentence wherein it is a series or spaces in series zero or more times. You can refine it to be the following though:
([a-zA-Z0-9])([a-zA-Z0-9\s])*
Which simply states that the above sequence must be prefaced with a alphanumeric character.
Hope this is what you were looking for.
Maybe this will help:
import re
source = """
This is a regular sentence.
this is also valid
so is This ONE
how about this one followed by this one
"""
re_sentence = re.compile(r'[^ \n.].*?(\.|\n| +)')
def main():
i = 0
for s in re_sentence.finditer(source):
print "%d:%s" % (i, s.group(0))
i += 1
if __name__ == '__main__':
main()
I am using alternation in the expression (\.|\n| +) to describe the end-of-sentence condition. Note the use of two spaces in the third alternation. The second space has the '+' meta-character so that two or more spaces in a row will be an end-of-sentence.